PROJET DE SOUTENANCE - DATAGONG x 10.000 Codeurs¶
Par Abdou-Raouf ATARMLA & Corneille HUEHA
0. PRESENTATION¶
Sujet : Prédiction du parti politique victorieux des élection présidentielles de 2020 aux USA à partir de données socio-démographiques
Plan de travail :
- Setup (Installation et import des librairies nécessaires)
- Import des données
- Préparation & Constitution de la donnée exploitable
- Analyse exploratoire
- Modélisation
- Evaluation
- Conclusion
1. Setup¶
1.1. INSTALLATION DES DEPENDANCES¶
Nous devons installer les librairies suivantes :
- pandas
- numpy
- scikit-learn
- xlrd
- openpyxl
- matplotlib
- seaborn
- shap
- xgboost
Pour cela, nous pouvons utiliser la commande suivante :
pip install -r requirements.txt
# Commande pour installer les dépendances
!pip install -r requirements.txt
Requirement already satisfied: pandas in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 1)) (2.2.3) Requirement already satisfied: numpy==2.1.0 in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 2)) (2.1.0) Requirement already satisfied: scikit-learn in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 3)) (1.6.1) Requirement already satisfied: xlrd in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 4)) (2.0.1) Requirement already satisfied: openpyxl in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 5)) (3.1.5) Requirement already satisfied: matplotlib in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 6)) (3.10.0) Requirement already satisfied: seaborn in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 7)) (0.13.2) Requirement already satisfied: shap in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 8)) (0.46.0) Requirement already satisfied: xgboost in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 9)) (2.1.4) Requirement already satisfied: imbalanced-learn in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 10)) (0.13.0) Requirement already satisfied: nbconvert in ./mvenv/lib/python3.10/site-packages (from -r requirements.txt (line 11)) (7.16.6) Requirement already satisfied: python-dateutil>=2.8.2 in ./mvenv/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 1)) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in ./mvenv/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 1)) (2025.1) Requirement already satisfied: tzdata>=2022.7 in ./mvenv/lib/python3.10/site-packages (from pandas->-r requirements.txt (line 1)) (2025.1) Requirement already satisfied: scipy>=1.6.0 in ./mvenv/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 3)) (1.15.1) Requirement already satisfied: threadpoolctl>=3.1.0 in ./mvenv/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 3)) (3.5.0) Requirement already satisfied: joblib>=1.2.0 in ./mvenv/lib/python3.10/site-packages (from scikit-learn->-r requirements.txt (line 3)) (1.4.2) Requirement already satisfied: et-xmlfile in ./mvenv/lib/python3.10/site-packages (from openpyxl->-r requirements.txt (line 5)) (2.0.0) Requirement already satisfied: fonttools>=4.22.0 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (4.56.0) Requirement already satisfied: pyparsing>=2.3.1 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (3.2.1) Requirement already satisfied: cycler>=0.10 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (0.12.1) Requirement already satisfied: contourpy>=1.0.1 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (1.3.1) Requirement already satisfied: packaging>=20.0 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (24.2) Requirement already satisfied: kiwisolver>=1.3.1 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (1.4.8) Requirement already satisfied: pillow>=8 in ./mvenv/lib/python3.10/site-packages (from matplotlib->-r requirements.txt (line 6)) (11.1.0) Requirement already satisfied: slicer==0.0.8 in ./mvenv/lib/python3.10/site-packages (from shap->-r requirements.txt (line 8)) (0.0.8) Requirement already satisfied: numba in ./mvenv/lib/python3.10/site-packages (from shap->-r requirements.txt (line 8)) (0.61.0) Requirement already satisfied: tqdm>=4.27.0 in ./mvenv/lib/python3.10/site-packages (from shap->-r requirements.txt (line 8)) (4.67.1) Requirement already satisfied: cloudpickle in ./mvenv/lib/python3.10/site-packages (from shap->-r requirements.txt (line 8)) (3.1.1) Requirement already satisfied: nvidia-nccl-cu12 in ./mvenv/lib/python3.10/site-packages (from xgboost->-r requirements.txt (line 9)) (2.25.1) Requirement already satisfied: sklearn-compat<1,>=0.1 in ./mvenv/lib/python3.10/site-packages (from imbalanced-learn->-r requirements.txt (line 10)) (0.1.3) Requirement already satisfied: jupyterlab-pygments in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (0.3.0) Requirement already satisfied: defusedxml in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (0.7.1) Requirement already satisfied: markupsafe>=2.0 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (3.0.2) Requirement already satisfied: jinja2>=3.0 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (3.1.5) Requirement already satisfied: pygments>=2.4.1 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (2.19.1) Requirement already satisfied: traitlets>=5.1 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (5.14.3) Requirement already satisfied: nbclient>=0.5.0 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (0.10.2) Requirement already satisfied: nbformat>=5.7 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (5.10.4) Requirement already satisfied: jupyter-core>=4.7 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (5.7.2) Requirement already satisfied: bleach[css]!=5.0.0 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (6.2.0) Requirement already satisfied: pandocfilters>=1.4.1 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (1.5.1) Requirement already satisfied: beautifulsoup4 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (4.13.3) Requirement already satisfied: mistune<4,>=2.0.3 in ./mvenv/lib/python3.10/site-packages (from nbconvert->-r requirements.txt (line 11)) (3.1.2) Requirement already satisfied: webencodings in ./mvenv/lib/python3.10/site-packages (from bleach[css]!=5.0.0->nbconvert->-r requirements.txt (line 11)) (0.5.1) Requirement already satisfied: tinycss2<1.5,>=1.1.0 in ./mvenv/lib/python3.10/site-packages (from bleach[css]!=5.0.0->nbconvert->-r requirements.txt (line 11)) (1.4.0) Requirement already satisfied: platformdirs>=2.5 in ./mvenv/lib/python3.10/site-packages (from jupyter-core>=4.7->nbconvert->-r requirements.txt (line 11)) (4.3.6) Requirement already satisfied: typing-extensions in ./mvenv/lib/python3.10/site-packages (from mistune<4,>=2.0.3->nbconvert->-r requirements.txt (line 11)) (4.12.2) Requirement already satisfied: jupyter-client>=6.1.12 in ./mvenv/lib/python3.10/site-packages (from nbclient>=0.5.0->nbconvert->-r requirements.txt (line 11)) (8.6.3) Requirement already satisfied: fastjsonschema>=2.15 in ./mvenv/lib/python3.10/site-packages (from nbformat>=5.7->nbconvert->-r requirements.txt (line 11)) (2.21.1) Requirement already satisfied: jsonschema>=2.6 in ./mvenv/lib/python3.10/site-packages (from nbformat>=5.7->nbconvert->-r requirements.txt (line 11)) (4.23.0) Requirement already satisfied: six>=1.5 in ./mvenv/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->-r requirements.txt (line 1)) (1.17.0) Requirement already satisfied: soupsieve>1.2 in ./mvenv/lib/python3.10/site-packages (from beautifulsoup4->nbconvert->-r requirements.txt (line 11)) (2.6) Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in ./mvenv/lib/python3.10/site-packages (from numba->shap->-r requirements.txt (line 8)) (0.44.0) Requirement already satisfied: rpds-py>=0.7.1 in ./mvenv/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert->-r requirements.txt (line 11)) (0.23.1) Requirement already satisfied: attrs>=22.2.0 in ./mvenv/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert->-r requirements.txt (line 11)) (25.1.0) Requirement already satisfied: referencing>=0.28.4 in ./mvenv/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert->-r requirements.txt (line 11)) (0.36.2) Requirement already satisfied: jsonschema-specifications>=2023.03.6 in ./mvenv/lib/python3.10/site-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert->-r requirements.txt (line 11)) (2024.10.1) Requirement already satisfied: pyzmq>=23.0 in ./mvenv/lib/python3.10/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert->-r requirements.txt (line 11)) (26.2.1) Requirement already satisfied: tornado>=6.2 in ./mvenv/lib/python3.10/site-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert->-r requirements.txt (line 11)) (6.4.2)
1.2. IMPORTATION DES LIBRAIRIES¶
import warnings
warnings.filterwarnings("ignore")
# Importation des bibliothèques principales
import numpy as np # Manipulation de tableaux numériques, gestion des calculs mathématiques
import pandas as pd # Gestion et manipulation des données sous forme de DataFrame
# Visualisation des données
import seaborn as sns # Création de graphiques statistiques avancés
import matplotlib.pyplot as plt # Génération de visualisations basiques (histogrammes, scatter plots, etc.)
# Préparation des données et séparation en train/test
from sklearn.model_selection import train_test_split # Découpage des données en ensembles d'entraînement et de test
from sklearn.preprocessing import OneHotEncoder # Encodage des variables catégorielles en variables numériques
# Modèle de base (Régression Logistique)
from sklearn.linear_model import LogisticRegression # Modèle de régression logistique pour la classification
# Évaluation des modèles
from sklearn.metrics import classification_report # Génération d’un rapport de performance des modèles
# Gestion des classes déséquilibrées
from sklearn.utils import resample # Sous-échantillonnage de la classe majoritaire (undersampling)
from imblearn.over_sampling import SMOTE # Sur-échantillonnage de la classe minoritaire (oversampling)
# Modèles avancés
from sklearn.ensemble import RandomForestClassifier # Modèle de Random Forest pour la classification
from xgboost import XGBClassifier # Modèle XGBoost pour une classification optimisée
# Optimisation des modèles
from sklearn.model_selection import GridSearchCV # Recherche des meilleurs hyperparamètres avec validation croisée
from imblearn.pipeline import Pipeline as ImbPipeline # Création d'un pipeline intégrant le prétraitement et le modèle
# Interprétabilité des modèles
import shap # Explication des prédictions des modèles avec SHAP (SHapley Additive exPlanations)
# Profilage
from tqdm import tqdm # Affichage de barres de progression pour suivre l’exécution des
import matplotlib.gridspec as gridspec
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from sklearn.pipeline import Pipeline
2. IMPORTATION DES DONNÉES¶
# Résultats des élections présidentielles de 2020 au niveau des comtés
elections_2020 = pd.read_csv('./data/2020_US_County_Level_Presidential_Results.csv')
# Résultats des élections présidentielles de 2008 à 2016 au niveau des comtés
# Ce fichier ne sera utilisé que pour l’analyse exploratoire (comparaison des tendances)
elections_08_16 = pd.read_csv('./data/US_County_Level_Presidential_Results_08-16.csv')
# Données démographiques : Estimations de la population par comté
population = pd.read_excel('./data/PopulationEstimates.xls', engine='xlrd', header=2)
# Données sur l’éducation : niveaux d’études atteints par comté
education = pd.read_excel('./data/Education.xls', engine='xlrd', header=4)
# Données sur la pauvreté : taux de pauvreté par comté
poverty = pd.read_excel('./data/PovertyEstimates.xls', engine='xlrd', header=4)
# Données sur le chômage : taux de chômage par comté
unemployment = pd.read_excel('./data/Unemployment.xls', engine='xlrd', header=4)
3. PREPARATION ET CONSTITUTION DES DONNEES¶
3.1. Etape 1¶
# Harmonisation des colonnes pour assurer une cohérence entre les datasets
# Sélection et renommage des colonnes pour les résultats des élections 2020
df_2020 = elections_2020[['county_fips', 'county_name', 'state_name']].rename(
columns={'county_fips': 'fips', 'county_name': 'county_name', 'state_name': 'state_name'}
)
# Sélection et renommage des colonnes pour les résultats des élections 2008-2016
df_08_16 = elections_08_16[['fips_code', 'county']].rename(
columns={'fips_code': 'fips', 'county': 'county_name'}
)
# Sélection et renommage des colonnes pour les données de population
df_population = population[['FIPStxt', 'Area_Name', 'State']].rename(
columns={'FIPStxt': 'fips', 'Area_Name': 'county_name', 'State': 'state_code'}
)
# Sélection et renommage des colonnes pour les données d'éducation
df_education = education[['FIPS Code', 'Area name', 'State']].rename(
columns={'FIPS Code': 'fips', 'Area name': 'county_name', 'State': 'state_code'}
)
# Sélection et renommage des colonnes pour les données de pauvreté
df_poverty = poverty[['FIPStxt', 'Area_name', 'Stabr']].rename(
columns={'FIPStxt': 'fips', 'Area_name': 'county_name', 'Stabr': 'state_code'}
)
# Sélection et renommage des colonnes pour les données de chômage
df_unemployment = unemployment[['fips_txt', 'area_name', 'Stabr']].rename(
columns={'fips_txt': 'fips', 'area_name': 'county_name', 'Stabr': 'state_code'}
)
# Concaténation de tous les jeux de données en un seul DataFrame
checkpoint_0_raw = pd.concat([
df_2020,
df_08_16,
df_population,
df_education,
df_poverty,
df_unemployment
], ignore_index=True)
checkpoint_0_raw
| fips | county_name | state_name | state_code | |
|---|---|---|---|---|
| 0 | 1001 | Autauga County | Alabama | NaN |
| 1 | 1003 | Baldwin County | Alabama | NaN |
| 2 | 1005 | Barbour County | Alabama | NaN |
| 3 | 1007 | Bibb County | Alabama | NaN |
| 4 | 1009 | Blount County | Alabama | NaN |
| ... | ... | ... | ... | ... |
| 19283 | 72145 | Vega Baja Municipio, PR | NaN | PR |
| 19284 | 72147 | Vieques Municipio, PR | NaN | PR |
| 19285 | 72149 | Villalba Municipio, PR | NaN | PR |
| 19286 | 72151 | Yabucoa Municipio, PR | NaN | PR |
| 19287 | 72153 | Yauco Municipio, PR | NaN | PR |
19288 rows × 4 columns
# Harmonisation du code FIPS sur 5 caractères (ajout de zéros devant si nécessaire)
checkpoint_0_raw['fips'] = checkpoint_0_raw['fips'].astype(str).str.zfill(5)
checkpoint_0_raw
| fips | county_name | state_name | state_code | |
|---|---|---|---|---|
| 0 | 01001 | Autauga County | Alabama | NaN |
| 1 | 01003 | Baldwin County | Alabama | NaN |
| 2 | 01005 | Barbour County | Alabama | NaN |
| 3 | 01007 | Bibb County | Alabama | NaN |
| 4 | 01009 | Blount County | Alabama | NaN |
| ... | ... | ... | ... | ... |
| 19283 | 72145 | Vega Baja Municipio, PR | NaN | PR |
| 19284 | 72147 | Vieques Municipio, PR | NaN | PR |
| 19285 | 72149 | Villalba Municipio, PR | NaN | PR |
| 19286 | 72151 | Yabucoa Municipio, PR | NaN | PR |
| 19287 | 72153 | Yauco Municipio, PR | NaN | PR |
19288 rows × 4 columns
# Suppression des doublons basés sur le FIPS (identifiant unique des comtés)
checkpoint_0 = checkpoint_0_raw.drop_duplicates(subset=['fips']).sort_values(by='fips').reset_index(drop=True)
# Vérification des premières lignes du DataFrame final
checkpoint_0
| fips | county_name | state_name | state_code | |
|---|---|---|---|---|
| 0 | 00000 | United States | NaN | US |
| 1 | 01000 | Alabama | NaN | AL |
| 2 | 01001 | Autauga County | Alabama | NaN |
| 3 | 01003 | Baldwin County | Alabama | NaN |
| 4 | 01005 | Barbour County | Alabama | NaN |
| ... | ... | ... | ... | ... |
| 3319 | 72145 | Vega Baja Municipio, Puerto Rico | NaN | PR |
| 3320 | 72147 | Vieques Municipio, Puerto Rico | NaN | PR |
| 3321 | 72149 | Villalba Municipio, Puerto Rico | NaN | PR |
| 3322 | 72151 | Yabucoa Municipio, Puerto Rico | NaN | PR |
| 3323 | 72153 | Yauco Municipio, Puerto Rico | NaN | PR |
3324 rows × 4 columns
# Sauvegarde du DataFrame harmonisé dans un fichier Excel
checkpoint_0.to_excel('checkpoints/save_0.xlsx', index=False)
print("\n✅ Données harmonisées sauvegardées dans 'checkpoints/save_0.xlsx'.")
✅ Données harmonisées sauvegardées dans 'checkpoints/save_0.xlsx'.
3.2. Etape 2¶
# Chargement du fichier harmonisé précédent
ch0 = pd.read_excel('checkpoints/save_0.xlsx')
# Liste des sources utilisées pour compléter les valeurs manquantes
sources = [df_2020, df_08_16, df_population, df_education, df_poverty, df_unemployment]
# Liste des colonnes à compléter si elles sont absentes
columns_to_fill = ['county_name', 'state_code', 'state_name']
# Vérification et création des colonnes manquantes si elles ne sont pas présentes dans `dm0`
for col in columns_to_fill:
if col not in ch0.columns:
ch0[col] = pd.NA # Remplissage initial avec des valeurs manquantes
# Fonction pour compléter les valeurs manquantes en se basant sur les autres sources
def fill_missing_values(base_df, sources, columns_to_fill):
"""
Remplit les valeurs manquantes d'un DataFrame en utilisant d'autres sources de données.
- base_df : DataFrame principal contenant des valeurs manquantes
- sources : Liste des DataFrames sources
- columns_to_fill : Liste des colonnes à compléter
"""
for source in sources:
for col in columns_to_fill:
if col in source.columns: # Vérifie si la colonne existe dans la source
base_df[col] = base_df[col].fillna(
base_df['fips'].map(source.set_index('fips')[col]) # Remplissage basé sur la correspondance FIPS
)
return base_df
# Application de la fonction pour compléter les valeurs manquantes
checkpoint_1_raw = fill_missing_values(ch0, sources, columns_to_fill)
# Vérification des valeurs manquantes après remplissage
ch1_missing_data = checkpoint_1_raw[columns_to_fill].isnull().sum()
print(f"Valeurs manquantes après remplissage :\n{ch1_missing_data}")
Valeurs manquantes après remplissage : county_name 0 state_code 41 state_name 172 dtype: int64
# Sauvegarde du fichier complété
checkpoint_1_raw.to_excel('checkpoints/save_1.xlsx', index=False)
print("\n✅ Données complétées sauvegardées dans 'checkpoints/save_1.xlsx'.")
✅ Données complétées sauvegardées dans 'checkpoints/save_1.xlsx'.
3.3. Etape 3¶
df_2020 = elections_2020[['county_fips', 'county_name', 'state_name']].rename(
columns={'county_fips': 'county_code', 'county_name': 'county_name', 'state_name': 'state_name'})
df_08_16 = elections_08_16[['fips_code', 'county']].rename(
columns={'fips_code': 'county_code', 'county': 'county_name'})
df_population = population[['FIPStxt', 'Area_Name', 'State']].rename(
columns={'FIPStxt': 'county_code', 'Area_Name': 'county_name', 'State': 'state_code'})
df_education = education[['FIPS Code', 'Area name', 'State']].rename(
columns={'FIPS Code': 'county_code', 'Area name': 'county_name', 'State': 'state_code'})
df_poverty = poverty[['FIPStxt', 'Area_name', 'Stabr']].rename(
columns={'FIPStxt': 'county_code', 'Area_name': 'county_name', 'Stabr': 'state_code'})
df_unemployment = unemployment[['fips_txt', 'area_name', 'Stabr']].rename(
columns={'fips_txt': 'county_code', 'area_name': 'county_name', 'Stabr': 'state_code'})
# Liste des sources et leur priorité
sources = [df_2020, df_08_16, df_population, df_education, df_poverty, df_unemployment]
# Charger le fichier harmonisé existant ou créer une base vide
ch1 = pd.read_excel('checkpoints/save_1.xlsx')
# Renommer la colonne principale en `county_code`
ch1.rename(columns={'fips': 'county_code'}, inplace=True)
# S'assurer que `county_code` est formaté sur 5 chiffres
ch1['county_code'] = ch1['county_code'].astype(str).str.zfill(5)
# Ajouter les colonnes manquantes
columns_to_fill = ['county_name', 'state_code', 'state_name']
for col in columns_to_fill:
if col not in ch1.columns:
ch1[col] = pd.NA
# Fonction pour combler les valeurs manquantes
def fill_missing_values(base_df, sources, columns_to_fill):
for source in sources:
# S'assurer que county_code est au bon format dans les sources
source['county_code'] = source['county_code'].astype(str).str.zfill(5)
for col in columns_to_fill:
if col in source.columns: # Vérifier que la colonne existe dans la source
base_df[col] = base_df[col].fillna(
base_df['county_code'].map(source.set_index('county_code')[col]) # Remplir selon la clé county_code
)
return base_df
# Compléter les colonnes manquantes
checkpoint_2_raw = fill_missing_values(ch1, sources, columns_to_fill)
# Vérification des valeurs manquantes
ch2_missing_data = checkpoint_2_raw[columns_to_fill].isnull().sum()
print(f"Valeurs manquantes après remplissage :\n{ch2_missing_data}")
Valeurs manquantes après remplissage : county_name 0 state_code 41 state_name 172 dtype: int64
# Sauvegarder les données complètes
checkpoint_2_raw.to_excel('checkpoints/save_2.xlsx', index=False)
print("\nDonnées complétées sauvegardées dans 'checkpoints/save_2.xlsx'.")
Données complétées sauvegardées dans 'checkpoints/save_2.xlsx'.
3.4. Etape 4¶
elections_2020 = elections_2020[['county_fips', 'per_gop', 'per_dem']].rename(columns={'county_fips': 'county_code'})
elections_08_16 = elections_08_16[['fips_code', 'total_2016', 'dem_2016', 'gop_2016']].rename(columns={'fips_code': 'county_code'})
population = population[['FIPStxt', 'Rural-urban_Continuum Code_2013', 'Urban_Influence_Code_2013']].rename(
columns={'FIPStxt': 'county_code', 'Rural-urban_Continuum Code_2013': 'rural_urban_code',
'Urban_Influence_Code_2013': 'urban_influence_code'})
education = education[['FIPS Code', 'Percent of adults with less than a high school diploma, 2015-19',
'Percent of adults with a high school diploma only, 2015-19',
'Percent of adults completing some college or associate\'s degree, 2015-19',
'Percent of adults with a bachelor\'s degree or higher, 2015-19']].rename(
columns={'FIPS Code': 'county_code',
'Percent of adults with less than a high school diploma, 2015-19': 'percent_no_highschool',
'Percent of adults with a high school diploma only, 2015-19': 'percent_highschool',
'Percent of adults completing some college or associate\'s degree, 2015-19': 'percent_college',
'Percent of adults with a bachelor\'s degree or higher, 2015-19': 'percent_bachelor'})
poverty = poverty[['FIPStxt', 'PCTPOVALL_2019', 'MEDHHINC_2019']].rename(
columns={'FIPStxt': 'county_code', 'PCTPOVALL_2019': 'percent_poverty',
'MEDHHINC_2019': 'median_household_income'})
unemployment = unemployment[['fips_txt', 'Unemployment_rate_2019', 'Employed_2019', 'Unemployed_2019']].rename(
columns={'fips_txt': 'county_code', 'Unemployment_rate_2019': 'unemployment_rate'})
# Standardiser le format des `county_code` (5 caractères)
datasets = [elections_2020, elections_08_16, population, education, poverty, unemployment]
for df in datasets:
df['county_code'] = df['county_code'].astype(str).str.zfill(5)
datamap = pd.read_excel('checkpoints/save_2.xlsx')
datamap['county_code'] = datamap['county_code'].astype(str).str.zfill(5)
# Liste des colonnes pertinentes à ajouter
columns_to_add = {
'elections_2020': ['per_gop', 'per_dem'],
'elections_08_16': ['total_2016', 'dem_2016', 'gop_2016'],
'population': ['rural_urban_code', 'urban_influence_code'],
'education': ['percent_no_highschool', 'percent_highschool', 'percent_college', 'percent_bachelor'],
'poverty': ['percent_poverty', 'median_household_income'],
'unemployment': ['unemployment_rate', 'Employed_2019', 'Unemployed_2019']
}
# Compléter les colonnes manquantes dans le fichier datamap
for source, cols in zip(datasets, columns_to_add.values()):
for col in cols:
if col not in datamap.columns:
datamap[col] = pd.NA
datamap[col] = datamap[col].fillna(datamap['county_code'].map(source.set_index('county_code')[col]))
print(datamap)
# Sauvegarder le fichier enrichi
datamap.to_excel('checkpoints/save_3.xlsx', index=False)
print("\nFichier 'checkpoints/save_3.xlsx' sauvegardé avec toutes les données intégrées.")
county_code county_name state_name state_code \
0 00000 United States NaN US
1 01000 Alabama NaN AL
2 01001 Autauga County Alabama AL
3 01003 Baldwin County Alabama AL
4 01005 Barbour County Alabama AL
... ... ... ... ...
3319 72145 Vega Baja Municipio, Puerto Rico NaN PR
3320 72147 Vieques Municipio, Puerto Rico NaN PR
3321 72149 Villalba Municipio, Puerto Rico NaN PR
3322 72151 Yabucoa Municipio, Puerto Rico NaN PR
3323 72153 Yauco Municipio, Puerto Rico NaN PR
per_gop per_dem total_2016 dem_2016 gop_2016 rural_urban_code \
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 0.714368 0.270184 24661.0 5908.0 18110.0 2.0
3 0.761714 0.224090 94090.0 18409.0 72780.0 3.0
4 0.534512 0.457882 10390.0 4848.0 5431.0 6.0
... ... ... ... ... ... ...
3319 NaN NaN NaN NaN NaN 1.0
3320 NaN NaN NaN NaN NaN 7.0
3321 NaN NaN NaN NaN NaN 2.0
3322 NaN NaN NaN NaN NaN 1.0
3323 NaN NaN NaN NaN NaN 2.0
urban_influence_code percent_no_highschool percent_highschool \
0 NaN 11.998918 26.956844
1 NaN 13.819302 30.800268
2 2.0 11.483395 33.588459
3 2.0 9.193843 27.659616
4 6.0 26.786907 35.604542
... ... ... ...
3319 1.0 28.428238 26.225822
3320 12.0 28.773281 39.177906
3321 2.0 21.993263 38.366028
3322 1.0 29.048897 25.715004
3323 2.0 26.556698 33.272095
percent_college percent_bachelor percent_poverty \
0 28.898697 32.145542 12.3
1 29.912098 25.468332 15.6
2 28.356571 26.571573 12.1
3 31.284081 31.862459 10.1
4 26.029837 11.578713 27.1
... ... ... ...
3319 24.123638 21.222300 NaN
3320 14.049454 17.999357 NaN
3321 19.727892 19.912819 NaN
3322 27.233078 18.003019 NaN
3323 15.529844 24.641363 NaN
median_household_income unemployment_rate Employed_2019 \
0 65712.0 3.669409 157115247.0
1 51771.0 3.000000 2174483.0
2 58233.0 2.700000 25458.0
3 59871.0 2.700000 94675.0
4 35972.0 3.800000 8213.0
... ... ... ...
3319 NaN 9.600000 11791.0
3320 NaN 6.900000 2406.0
3321 NaN 15.900000 6231.0
3322 NaN 13.100000 7552.0
3323 NaN 14.600000 8331.0
Unemployed_2019
0 5984808.0
1 67264.0
2 714.0
3 2653.0
4 324.0
... ...
3319 1246.0
3320 179.0
3321 1175.0
3322 1139.0
3323 1428.0
[3324 rows x 20 columns]
Fichier 'checkpoints/save_3.xlsx' sauvegardé avec toutes les données intégrées.
# Vérifier les valeurs manquantes
missing_data = datamap.isnull().sum()
print("Valeurs manquantes par colonne :\n", missing_data)
# Pourcentage de valeurs manquantes
missing_percentage = (missing_data / len(datamap)) * 100
print("\nPourcentage de valeurs manquantes par colonne :\n", missing_percentage)
# Vérifier les doublons
duplicate_count = datamap.duplicated().sum()
print(f"Nombre de doublons dans le DataFrame : {duplicate_count}")
# Statistiques descriptives
stats = datamap.describe()
print("Statistiques descriptives :\n", stats)
Valeurs manquantes par colonne : county_code 0 county_name 0 state_name 172 state_code 41 per_gop 172 per_dem 172 total_2016 212 dem_2016 212 gop_2016 212 rural_urban_code 104 urban_influence_code 104 percent_no_highschool 51 percent_highschool 51 percent_college 51 percent_bachelor 51 percent_poverty 131 median_household_income 131 unemployment_rate 52 Employed_2019 52 Unemployed_2019 52 dtype: int64 Pourcentage de valeurs manquantes par colonne : county_code 0.000000 county_name 0.000000 state_name 5.174489 state_code 1.233454 per_gop 5.174489 per_dem 5.174489 total_2016 6.377858 dem_2016 6.377858 gop_2016 6.377858 rural_urban_code 3.128761 urban_influence_code 3.128761 percent_no_highschool 1.534296 percent_highschool 1.534296 percent_college 1.534296 percent_bachelor 1.534296 percent_poverty 3.941035 median_household_income 3.941035 unemployment_rate 1.564380 Employed_2019 1.564380 Unemployed_2019 1.564380 dtype: float64 Nombre de doublons dans le DataFrame : 0
Statistiques descriptives :
per_gop per_dem total_2016 dem_2016 gop_2016 \
count 3152.000000 3152.000000 3.112000e+03 3.112000e+03 3112.000000
mean 0.647805 0.333851 4.089631e+04 1.956104e+04 19343.762211
std 0.162014 0.159852 1.082522e+05 6.847899e+04 39125.598644
min 0.053973 0.030909 6.400000e+01 4.000000e+00 57.000000
25% 0.554128 0.209978 4.815000e+03 1.164750e+03 3206.000000
50% 0.681720 0.300235 1.092950e+04 3.140000e+03 7113.000000
75% 0.773776 0.425830 2.866450e+04 9.535250e+03 17391.750000
max 0.961818 0.921497 2.314275e+06 1.654626e+06 590465.000000
rural_urban_code urban_influence_code percent_no_highschool \
count 3220.000000 3220.000000 3273.000000
mean 4.937888 5.188820 13.330532
std 2.724344 3.506848 6.545762
min 1.000000 1.000000 1.116910
25% 2.000000 2.000000 8.540109
50% 6.000000 5.000000 11.884497
75% 7.000000 8.000000 17.020765
max 9.000000 12.000000 73.560211
percent_highschool percent_college percent_bachelor percent_poverty \
count 3273.000000 3273.000000 3273.000000 3193.000000
mean 33.956041 30.587866 22.125561 14.417946
std 7.212828 5.340745 9.536379 5.769337
min 7.265136 5.235602 0.000000 2.700000
25% 29.369493 27.022669 15.511985 10.400000
50% 34.249691 30.628832 19.776859 13.400000
75% 38.947308 34.079266 26.348045 17.400000
max 57.433674 60.563381 77.557411 47.700000
median_household_income unemployment_rate Employed_2019 \
count 3193.000000 3272.000000 3.272000e+03
mean 55874.761979 4.139630 1.446578e+05
std 14493.345229 1.785734 2.808116e+06
min 24732.000000 0.700000 2.120000e+02
25% 46309.000000 3.000000 4.778750e+03
50% 53505.000000 3.700000 1.129300e+04
75% 62327.000000 4.700000 3.218725e+04
max 151806.000000 19.300000 1.571152e+08
Unemployed_2019
count 3.272000e+03
mean 5.541794e+03
std 1.071326e+05
min 4.000000e+00
25% 2.030000e+02
50% 4.990000e+02
75% 1.328500e+03
max 5.984808e+06
REMARQUE¶
Ici on remarque que les données qui manques pour per_gop, per_dem, total_2016, dem_2016, gop_2016, rural_urban_code, urban_influence_code les zone regionales.
par exemple Alasca (02000) a ses county 02*** Alabama (01000) a ses county 01***
# Étape initiale : Identifier les états
datamap['is_state'] = datamap['county_code'].apply(lambda x: 1 if x.endswith('000') else 0)
datamap_complete = datamap.copy()
# Supprimer la ligne correspondant à county_code = "00000" (United States)
datamap_complete = datamap_complete[datamap_complete['county_code'] != '00000']
print(f"Ligne avec county_code '00000' (United States) supprimée. Nombre de lignes restantes : {len(datamap_complete)}")
# Mettre à jour state_name pour les états
datamap_complete.loc[datamap_complete['is_state'] == 1, 'state_name'] = datamap_complete['county_name']
# Afficher le dataframe initial
print(datamap_complete)
# Sauvegarder le fichier enrichi
datamap_complete.to_excel('checkpoints/save_4.xlsx', index=False)
print("\nFichier 'checkpoints/save_4.xlsx' sauvegardé avec toutes les données intégrées.")
# Ajouter state_prefix
datamap_complete['state_prefix'] = datamap_complete['county_code'].str[:2]
# Agrégation existante pour per_gop, per_dem, etc.
state_agg = datamap_complete[datamap_complete['is_state'] == 0].groupby('state_prefix').agg({
'per_gop': 'mean',
'per_dem': 'mean',
'gop_2016': 'sum',
'dem_2016': 'sum',
'total_2016': 'sum'
}).reset_index()
# Mettre à jour les lignes des états avec les agrégations
for index, row in state_agg.iterrows():
state_prefix = row['state_prefix']
state_county_code = state_prefix + '000'
datamap_complete.loc[(datamap_complete['is_state'] == 1) & (datamap_complete['county_code'] == state_county_code), 'per_gop'] = row['per_gop']
datamap_complete.loc[(datamap_complete['is_state'] == 1) & (datamap_complete['county_code'] == state_county_code), 'per_dem'] = row['per_dem']
datamap_complete.loc[(datamap_complete['is_state'] == 1) & (datamap_complete['county_code'] == state_county_code), 'total_2016'] = row['total_2016']
datamap_complete.loc[(datamap_complete['is_state'] == 1) & (datamap_complete['county_code'] == state_county_code), 'gop_2016'] = row['gop_2016']
datamap_complete.loc[(datamap_complete['is_state'] == 1) & (datamap_complete['county_code'] == state_county_code), 'dem_2016'] = row['dem_2016']
# Nouvelle étape : Calculer la distribution des rural_urban_code par état
ruc_dist = datamap_complete[datamap_complete['is_state'] == 0].groupby('state_prefix')['rural_urban_code'].value_counts(normalize=True).unstack(fill_value=0)
ruc_dist.columns = [f'ruc_{int(col)}' for col in ruc_dist.columns] # Renommer colonnes : ruc_1, ruc_2, etc.
# Si urban_influence_code est présent (préparation pour le futur)
if 'urban_influence_code' in datamap_complete.columns:
uic_dist = datamap_complete[datamap_complete['is_state'] == 0].groupby('state_prefix')['urban_influence_code'].value_counts(normalize=True).unstack(fill_value=0)
uic_dist.columns = [f'uic_{int(col)}' for col in uic_dist.columns] # Renommer : uic_1, uic_2, etc.
else:
print("Note : 'urban_influence_code' n'est pas dans le dataset. Seuls les RUCC seront calculés.")
# Fusionner avec les lignes des états
df_states = datamap_complete[datamap_complete['is_state'] == 1].set_index('state_prefix')
df_states = df_states.join(ruc_dist, how='left')
if 'urban_influence_code' in datamap_complete.columns:
df_states = df_states.join(uic_dist, how='left')
# Réintégrer dans le dataframe complet
datamap_complete = pd.concat([datamap_complete[datamap_complete['is_state'] == 0], df_states.reset_index()], ignore_index=True)
# Remplir les NaN avec 0 pour les nouvelles colonnes
for col in ruc_dist.columns:
datamap_complete[col] = datamap_complete[col].fillna(0)
if 'urban_influence_code' in datamap_complete.columns:
for col in uic_dist.columns:
datamap_complete[col] = datamap_complete[col].fillna(0)
# Afficher les colonnes pertinentes, y compris les nouvelles distributions RUCC
print(datamap_complete[['county_code', 'county_name', 'state_name', 'state_code', 'per_gop', 'per_dem', 'total_2016', 'gop_2016', 'dem_2016', 'is_state'] + [col for col in datamap_complete.columns if col.startswith('ruc_')]])
# Sauvegarder le fichier enrichi
datamap_complete.to_excel('checkpoints/save_5.xlsx', index=False)
print("\nFichier 'checkpoints/save_5.xlsx' sauvegardé avec toutes les données intégrées, y compris les distributions RUCC.")
Ligne avec county_code '00000' (United States) supprimée. Nombre de lignes restantes : 3323
county_code county_name state_name state_code \
1 01000 Alabama Alabama AL
2 01001 Autauga County Alabama AL
3 01003 Baldwin County Alabama AL
4 01005 Barbour County Alabama AL
5 01007 Bibb County Alabama AL
... ... ... ... ...
3319 72145 Vega Baja Municipio, Puerto Rico NaN PR
3320 72147 Vieques Municipio, Puerto Rico NaN PR
3321 72149 Villalba Municipio, Puerto Rico NaN PR
3322 72151 Yabucoa Municipio, Puerto Rico NaN PR
3323 72153 Yauco Municipio, Puerto Rico NaN PR
per_gop per_dem total_2016 dem_2016 gop_2016 rural_urban_code \
1 NaN NaN NaN NaN NaN NaN
2 0.714368 0.270184 24661.0 5908.0 18110.0 2.0
3 0.761714 0.224090 94090.0 18409.0 72780.0 3.0
4 0.534512 0.457882 10390.0 4848.0 5431.0 6.0
5 0.784263 0.206983 8748.0 1874.0 6733.0 1.0
... ... ... ... ... ... ...
3319 NaN NaN NaN NaN NaN 1.0
3320 NaN NaN NaN NaN NaN 7.0
3321 NaN NaN NaN NaN NaN 2.0
3322 NaN NaN NaN NaN NaN 1.0
3323 NaN NaN NaN NaN NaN 2.0
... percent_no_highschool percent_highschool percent_college \
1 ... 13.819302 30.800268 29.912098
2 ... 11.483395 33.588459 28.356571
3 ... 9.193843 27.659616 31.284081
4 ... 26.786907 35.604542 26.029837
5 ... 20.942602 44.878773 23.800098
... ... ... ... ...
3319 ... 28.428238 26.225822 24.123638
3320 ... 28.773281 39.177906 14.049454
3321 ... 21.993263 38.366028 19.727892
3322 ... 29.048897 25.715004 27.233078
3323 ... 26.556698 33.272095 15.529844
percent_bachelor percent_poverty median_household_income \
1 25.468332 15.6 51771.0
2 26.571573 12.1 58233.0
3 31.862459 10.1 59871.0
4 11.578713 27.1 35972.0
5 10.378526 20.3 47918.0
... ... ... ...
3319 21.222300 NaN NaN
3320 17.999357 NaN NaN
3321 19.912819 NaN NaN
3322 18.003019 NaN NaN
3323 24.641363 NaN NaN
unemployment_rate Employed_2019 Unemployed_2019 is_state
1 3.0 2174483.0 67264.0 1
2 2.7 25458.0 714.0 0
3 2.7 94675.0 2653.0 0
4 3.8 8213.0 324.0 0
5 3.1 8419.0 266.0 0
... ... ... ... ...
3319 9.6 11791.0 1246.0 0
3320 6.9 2406.0 179.0 0
3321 15.9 6231.0 1175.0 0
3322 13.1 7552.0 1139.0 0
3323 14.6 8331.0 1428.0 0
[3323 rows x 21 columns]
Fichier 'checkpoints/save_4.xlsx' sauvegardé avec toutes les données intégrées.
county_code county_name state_name state_code per_gop \
0 01001 Autauga County Alabama AL 0.714368
1 01003 Baldwin County Alabama AL 0.761714
2 01005 Barbour County Alabama AL 0.534512
3 01007 Bibb County Alabama AL 0.784263
4 01009 Blount County Alabama AL 0.895716
... ... ... ... ... ...
3318 53000 Washington Washington WA 0.520402
3319 54000 West Virginia West Virginia WV 0.741402
3320 55000 Wisconsin Wisconsin WI 0.564259
3321 56000 Wyoming Wyoming WY 0.750912
3322 72000 Puerto Rico Puerto Rico PR NaN
per_dem total_2016 gop_2016 dem_2016 is_state ruc_1 \
0 0.270184 24661.0 18110.0 5908.0 0 0.000000
1 0.224090 94090.0 72780.0 18409.0 0 0.000000
2 0.457882 10390.0 5431.0 4848.0 0 0.000000
3 0.206983 8748.0 6733.0 1874.0 0 0.000000
4 0.095694 25384.0 22808.0 2150.0 0 0.000000
... ... ... ... ... ... ...
3318 0.448434 2765627.0 1043648.0 1523720.0 1 0.128205
3319 0.243346 708226.0 486198.0 187457.0 1 0.018182
3320 0.419635 2944620.0 1409467.0 1382210.0 1 0.097222
3321 0.217684 248742.0 174248.0 55949.0 1 0.000000
3322 NaN 0.0 0.0 0.0 1 0.512821
ruc_2 ruc_3 ruc_4 ruc_5 ruc_6 ruc_7 ruc_8 \
0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
... ... ... ... ... ... ... ...
3318 0.179487 0.230769 0.153846 0.051282 0.102564 0.025641 0.076923
3319 0.090909 0.272727 0.036364 0.018182 0.236364 0.127273 0.127273
3320 0.111111 0.152778 0.097222 0.000000 0.277778 0.083333 0.111111
3321 0.000000 0.086957 0.043478 0.086957 0.043478 0.565217 0.000000
3322 0.205128 0.166667 0.038462 0.000000 0.051282 0.012821 0.000000
ruc_9
0 0.000000
1 0.000000
2 0.000000
3 0.000000
4 0.000000
... ...
3318 0.051282
3319 0.072727
3320 0.069444
3321 0.173913
3322 0.012821
[3323 rows x 19 columns]
Fichier 'checkpoints/save_5.xlsx' sauvegardé avec toutes les données intégrées, y compris les distributions RUCC.
# Filtrer uniquement les États (is_state == 1)
datamap_states = datamap_complete[datamap_complete['is_state'] == 1].copy()
# Vérification
print(f"Nombre total d'observations : {len(datamap)}")
print(f"Nombre d'États : {len(datamap_states)}")
print(datamap_states.head())
# Sauvegarder le fichier enrichi
datamap_states.to_excel('checkpoints/states_1.xlsx', index=False)
print("\nFichier 'checkpoints/states_1.xlsx' sauvegardé avec toutes les données intégrées.")
Nombre total d'observations : 3324
Nombre d'États : 52
county_code county_name state_name state_code per_gop per_dem \
3271 01000 Alabama Alabama AL 0.647359 0.342648
3272 02000 Alaska Alaska AK 0.497797 0.420912
3273 04000 Arizona Arizona AZ 0.548723 0.435861
3274 05000 Arkansas Arkansas AR 0.688531 0.282032
3275 06000 California California CA 0.439389 0.537068
total_2016 dem_2016 gop_2016 rural_urban_code ... uic_3 \
3271 2078165.0 718084.0 1306925.0 NaN ... 0.044776
3272 0.0 0.0 0.0 NaN ... 0.000000
3273 2062810.0 936250.0 1021154.0 NaN ... 0.066667
3274 1108615.0 378729.0 677904.0 NaN ... 0.040000
3275 9631972.0 5931283.0 3184721.0 NaN ... 0.017241
uic_4 uic_5 uic_6 uic_7 uic_8 uic_9 uic_10 \
3271 0.059701 0.104478 0.208955 0.029851 0.000000 0.000000 0.044776
3272 0.000000 0.000000 0.000000 0.034483 0.068966 0.000000 0.068966
3273 0.066667 0.133333 0.066667 0.000000 0.066667 0.066667 0.000000
3274 0.013333 0.053333 0.213333 0.040000 0.133333 0.146667 0.066667
3275 0.051724 0.068966 0.086207 0.034483 0.051724 0.000000 0.000000
uic_11 uic_12
3271 0.074627 0.000000
3272 0.344828 0.379310
3273 0.000000 0.000000
3274 0.026667 0.000000
3275 0.034483 0.017241
[5 rows x 43 columns]
Fichier 'checkpoints/states_1.xlsx' sauvegardé avec toutes les données intégrées.
to_drop_state_cols = ['rural_urban_code', 'urban_influence_code','is_state','state_prefix','county_name']
cleaned_state_df = datamap_states.drop(to_drop_state_cols, axis=1)
cleaned_state_df.rename(columns={'county_code': 'id'}, inplace=True)
cleaned_state_df.head()
| id | state_name | state_code | per_gop | per_dem | total_2016 | dem_2016 | gop_2016 | percent_no_highschool | percent_highschool | ... | uic_3 | uic_4 | uic_5 | uic_6 | uic_7 | uic_8 | uic_9 | uic_10 | uic_11 | uic_12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3271 | 01000 | Alabama | AL | 0.647359 | 0.342648 | 2078165.0 | 718084.0 | 1306925.0 | 13.819302 | 30.800268 | ... | 0.044776 | 0.059701 | 0.104478 | 0.208955 | 0.029851 | 0.000000 | 0.000000 | 0.044776 | 0.074627 | 0.000000 |
| 3272 | 02000 | Alaska | AK | 0.497797 | 0.420912 | 0.0 | 0.0 | 0.0 | 7.152934 | 28.003729 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.034483 | 0.068966 | 0.000000 | 0.068966 | 0.344828 | 0.379310 |
| 3273 | 04000 | Arizona | AZ | 0.548723 | 0.435861 | 2062810.0 | 936250.0 | 1021154.0 | 12.860705 | 23.858877 | ... | 0.066667 | 0.066667 | 0.133333 | 0.066667 | 0.000000 | 0.066667 | 0.066667 | 0.000000 | 0.000000 | 0.000000 |
| 3274 | 05000 | Arkansas | AR | 0.688531 | 0.282032 | 1108615.0 | 378729.0 | 677904.0 | 13.430243 | 34.034885 | ... | 0.040000 | 0.013333 | 0.053333 | 0.213333 | 0.040000 | 0.133333 | 0.146667 | 0.066667 | 0.026667 | 0.000000 |
| 3275 | 06000 | California | CA | 0.439389 | 0.537068 | 9631972.0 | 5931283.0 | 3184721.0 | 16.692171 | 20.487896 | ... | 0.017241 | 0.051724 | 0.068966 | 0.086207 | 0.034483 | 0.051724 | 0.000000 | 0.000000 | 0.034483 | 0.017241 |
5 rows × 38 columns
# Retirons l'état à l'id 72000 qui correspond au 'Puerto Rico' vu qu'il ne dispose pas de details permettant de determiner la variable cible
cleaned_state_df = cleaned_state_df[cleaned_state_df['id'] != '72000']
cleaned_state_df
| id | state_name | state_code | per_gop | per_dem | total_2016 | dem_2016 | gop_2016 | percent_no_highschool | percent_highschool | ... | uic_3 | uic_4 | uic_5 | uic_6 | uic_7 | uic_8 | uic_9 | uic_10 | uic_11 | uic_12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3271 | 01000 | Alabama | AL | 0.647359 | 0.342648 | 2078165.0 | 718084.0 | 1306925.0 | 13.819302 | 30.800268 | ... | 0.044776 | 0.059701 | 0.104478 | 0.208955 | 0.029851 | 0.000000 | 0.000000 | 0.044776 | 0.074627 | 0.000000 |
| 3272 | 02000 | Alaska | AK | 0.497797 | 0.420912 | 0.0 | 0.0 | 0.0 | 7.152934 | 28.003729 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.034483 | 0.068966 | 0.000000 | 0.068966 | 0.344828 | 0.379310 |
| 3273 | 04000 | Arizona | AZ | 0.548723 | 0.435861 | 2062810.0 | 936250.0 | 1021154.0 | 12.860705 | 23.858877 | ... | 0.066667 | 0.066667 | 0.133333 | 0.066667 | 0.000000 | 0.066667 | 0.066667 | 0.000000 | 0.000000 | 0.000000 |
| 3274 | 05000 | Arkansas | AR | 0.688531 | 0.282032 | 1108615.0 | 378729.0 | 677904.0 | 13.430243 | 34.034885 | ... | 0.040000 | 0.013333 | 0.053333 | 0.213333 | 0.040000 | 0.133333 | 0.146667 | 0.066667 | 0.026667 | 0.000000 |
| 3275 | 06000 | California | CA | 0.439389 | 0.537068 | 9631972.0 | 5931283.0 | 3184721.0 | 16.692171 | 20.487896 | ... | 0.017241 | 0.051724 | 0.068966 | 0.086207 | 0.034483 | 0.051724 | 0.000000 | 0.000000 | 0.034483 | 0.017241 |
| 3276 | 08000 | Colorado | CO | 0.559502 | 0.417248 | 2564185.0 | 1212209.0 | 1137455.0 | 8.253678 | 21.368059 | ... | 0.015625 | 0.015625 | 0.046875 | 0.062500 | 0.046875 | 0.109375 | 0.000000 | 0.140625 | 0.125000 | 0.171875 |
| 3277 | 09000 | Connecticut | CT | 0.424576 | 0.557866 | 1623542.0 | 884432.0 | 668266.0 | 9.369879 | 26.854712 | ... | 0.125000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 3278 | 10000 | Delaware | DE | 0.443037 | 0.542737 | 441535.0 | 235581.0 | 185103.0 | 9.982669 | 31.292805 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 3279 | 11000 | District of Columbia | DC | 0.053973 | 0.921497 | 280272.0 | 260223.0 | 11553.0 | 9.076816 | 16.835115 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 3280 | 12000 | Florida | FL | 0.633620 | 0.357409 | 9386750.0 | 4485745.0 | 4605515.0 | 11.810859 | 28.573500 | ... | 0.089552 | 0.044776 | 0.014925 | 0.149254 | 0.029851 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014925 |
| 3281 | 13000 | Georgia | GA | 0.639809 | 0.350515 | 4029564.0 | 1837300.0 | 2068623.0 | 12.855142 | 27.714716 | ... | 0.037736 | 0.044025 | 0.075472 | 0.157233 | 0.050314 | 0.062893 | 0.056604 | 0.012579 | 0.012579 | 0.025157 |
| 3282 | 15000 | Hawaii | HI | 0.330023 | 0.648358 | 428825.0 | 266827.0 | 128815.0 | 8.028257 | 27.356155 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.400000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 3283 | 16000 | Idaho | ID | 0.730509 | 0.240739 | 688235.0 | 189677.0 | 407199.0 | 9.226700 | 27.356852 | ... | 0.000000 | 0.000000 | 0.159091 | 0.136364 | 0.113636 | 0.181818 | 0.045455 | 0.000000 | 0.045455 | 0.045455 |
| 3284 | 17000 | Illinois | IL | 0.652193 | 0.327196 | 5374280.0 | 2977498.0 | 2118179.0 | 10.787586 | 25.954943 | ... | 0.049020 | 0.058824 | 0.078431 | 0.147059 | 0.029412 | 0.107843 | 0.088235 | 0.009804 | 0.019608 | 0.019608 |
| 3285 | 18000 | Indiana | IN | 0.688717 | 0.291546 | 2722029.0 | 1031953.0 | 1556220.0 | 11.181375 | 33.406757 | ... | 0.076087 | 0.108696 | 0.141304 | 0.097826 | 0.021739 | 0.054348 | 0.021739 | 0.000000 | 0.000000 | 0.000000 |
| 3286 | 19000 | Iowa | IA | 0.638318 | 0.344197 | 1542880.0 | 650790.0 | 798923.0 | 7.908792 | 30.982542 | ... | 0.000000 | 0.000000 | 0.050505 | 0.262626 | 0.070707 | 0.121212 | 0.141414 | 0.050505 | 0.050505 | 0.040404 |
| 3287 | 20000 | Kansas | KS | 0.752552 | 0.227409 | 1147143.0 | 414788.0 | 656009.0 | 9.048409 | 25.906137 | ... | 0.019048 | 0.019048 | 0.047619 | 0.085714 | 0.028571 | 0.104762 | 0.095238 | 0.142857 | 0.066667 | 0.209524 |
| 3288 | 21000 | Kentucky | KY | 0.740423 | 0.245121 | 1922218.0 | 628834.0 | 1202942.0 | 13.738696 | 32.893917 | ... | 0.033333 | 0.041667 | 0.033333 | 0.091667 | 0.066667 | 0.150000 | 0.083333 | 0.091667 | 0.041667 | 0.075000 |
| 3289 | 22000 | Louisiana | LA | 0.646492 | 0.339012 | 2027731.0 | 779535.0 | 1178004.0 | 14.773869 | 33.962753 | ... | 0.015625 | 0.015625 | 0.093750 | 0.171875 | 0.031250 | 0.031250 | 0.015625 | 0.046875 | 0.031250 | 0.000000 |
| 3290 | 23000 | Maine | ME | 0.486435 | 0.485294 | 741550.0 | 354873.0 | 334838.0 | 7.389770 | 31.473530 | ... | 0.000000 | 0.000000 | 0.062500 | 0.375000 | 0.062500 | 0.000000 | 0.000000 | 0.000000 | 0.187500 | 0.000000 |
| 3291 | 24000 | Maryland | MD | 0.477075 | 0.498362 | 2474543.0 | 1497951.0 | 873646.0 | 9.796140 | 24.611042 | ... | 0.041667 | 0.083333 | 0.041667 | 0.000000 | 0.041667 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 3292 | 25000 | Massachusetts | MA | 0.308799 | 0.668589 | 3231531.0 | 1964768.0 | 1083069.0 | 9.242436 | 24.019262 | ... | 0.000000 | 0.000000 | 0.071429 | 0.000000 | 0.000000 | 0.071429 | 0.000000 | 0.000000 | 0.071429 | 0.000000 |
| 3293 | 26000 | Michigan | MI | 0.596681 | 0.387700 | 4789450.0 | 2267373.0 | 2279210.0 | 9.190457 | 28.873878 | ... | 0.012048 | 0.024096 | 0.108434 | 0.036145 | 0.024096 | 0.180723 | 0.048193 | 0.120482 | 0.108434 | 0.024096 |
| 3294 | 27000 | Minnesota | MN | 0.602136 | 0.376106 | 2916404.0 | 1366676.0 | 1322891.0 | 6.859513 | 24.647589 | ... | 0.045977 | 0.080460 | 0.080460 | 0.126437 | 0.057471 | 0.068966 | 0.091954 | 0.034483 | 0.034483 | 0.068966 |
| 3295 | 28000 | Mississippi | MS | 0.562834 | 0.423827 | 1162987.0 | 462001.0 | 678457.0 | 15.493731 | 30.438028 | ... | 0.024390 | 0.048780 | 0.073171 | 0.109756 | 0.109756 | 0.219512 | 0.146341 | 0.060976 | 0.000000 | 0.000000 |
| 3296 | 29000 | Missouri | MO | 0.752084 | 0.232851 | 2775098.0 | 1054889.0 | 1585753.0 | 10.078580 | 30.617037 | ... | 0.034783 | 0.095652 | 0.060870 | 0.113043 | 0.052174 | 0.095652 | 0.060870 | 0.139130 | 0.034783 | 0.017391 |
| 3297 | 30000 | Montana | MT | 0.689422 | 0.287197 | 483574.0 | 174521.0 | 274120.0 | 6.449651 | 28.832081 | ... | 0.000000 | 0.000000 | 0.000000 | 0.071429 | 0.160714 | 0.089286 | 0.089286 | 0.053571 | 0.160714 | 0.285714 |
| 3298 | 31000 | Nebraska | NE | 0.780742 | 0.198169 | 805638.0 | 273858.0 | 485819.0 | 8.595745 | 26.106092 | ... | 0.000000 | 0.000000 | 0.043011 | 0.053763 | 0.096774 | 0.139785 | 0.064516 | 0.172043 | 0.086022 | 0.204301 |
| 3299 | 32000 | Nevada | NV | 0.696978 | 0.277140 | 1122990.0 | 537753.0 | 511319.0 | 13.309174 | 28.085070 | ... | 0.117647 | 0.058824 | 0.117647 | 0.000000 | 0.000000 | 0.176471 | 0.058824 | 0.117647 | 0.117647 | 0.000000 |
| 3300 | 33000 | New Hampshire | NH | 0.458564 | 0.523999 | 730628.0 | 348126.0 | 345598.0 | 6.894038 | 27.419645 | ... | 0.200000 | 0.100000 | 0.100000 | 0.000000 | 0.000000 | 0.300000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 3301 | 34000 | New Jersey | NJ | 0.437915 | 0.544962 | 3674893.0 | 2021756.0 | 1535513.0 | 10.183736 | 27.185795 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 3302 | 35000 | New Mexico | NM | 0.532221 | 0.447902 | 783127.0 | 380724.0 | 315875.0 | 14.411727 | 26.460430 | ... | 0.000000 | 0.000000 | 0.181818 | 0.060606 | 0.000000 | 0.242424 | 0.090909 | 0.060606 | 0.090909 | 0.060606 |
| 3303 | 36000 | New York | NY | 0.508439 | 0.471869 | 7046175.0 | 4143874.0 | 2640570.0 | 13.179301 | 25.977776 | ... | 0.096774 | 0.032258 | 0.080645 | 0.080645 | 0.032258 | 0.048387 | 0.016129 | 0.000000 | 0.000000 | 0.000000 |
| 3304 | 37000 | North Carolina | NC | 0.584579 | 0.403189 | 4629471.0 | 2162074.0 | 2339603.0 | 12.219548 | 25.652466 | ... | 0.110000 | 0.050000 | 0.130000 | 0.050000 | 0.060000 | 0.040000 | 0.020000 | 0.040000 | 0.010000 | 0.030000 |
| 3305 | 38000 | North Dakota | ND | 0.725569 | 0.247834 | 336968.0 | 93526.0 | 216133.0 | 7.351314 | 26.429096 | ... | 0.000000 | 0.000000 | 0.018868 | 0.037736 | 0.150943 | 0.113208 | 0.018868 | 0.245283 | 0.037736 | 0.264151 |
| 3306 | 39000 | Ohio | OH | 0.674596 | 0.310536 | 5325395.0 | 2317001.0 | 2771984.0 | 9.621357 | 33.037495 | ... | 0.193182 | 0.056818 | 0.159091 | 0.079545 | 0.011364 | 0.022727 | 0.034091 | 0.011364 | 0.000000 | 0.000000 |
| 3307 | 40000 | Oklahoma | OK | 0.778398 | 0.202755 | 1451056.0 | 419788.0 | 947934.0 | 11.976947 | 31.330032 | ... | 0.038961 | 0.077922 | 0.064935 | 0.129870 | 0.012987 | 0.129870 | 0.155844 | 0.129870 | 0.012987 | 0.012987 |
| 3308 | 41000 | Oregon | OR | 0.566156 | 0.404151 | 1808575.0 | 934631.0 | 742506.0 | 9.287846 | 22.735300 | ... | 0.083333 | 0.027778 | 0.138889 | 0.027778 | 0.000000 | 0.138889 | 0.000000 | 0.083333 | 0.055556 | 0.083333 |
| 3309 | 42000 | Pennsylvania | PA | 0.635927 | 0.350978 | 5970107.0 | 2844705.0 | 2912941.0 | 9.480545 | 34.693886 | ... | 0.059701 | 0.044776 | 0.149254 | 0.029851 | 0.059701 | 0.029851 | 0.059701 | 0.014925 | 0.000000 | 0.000000 |
| 3310 | 44000 | Rhode Island | RI | 0.380590 | 0.598526 | 450121.0 | 249902.0 | 179421.0 | 11.186517 | 28.267363 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 3311 | 45000 | South Carolina | SC | 0.535671 | 0.452410 | 2084444.0 | 849469.0 | 1143611.0 | 12.488965 | 29.103146 | ... | 0.000000 | 0.021739 | 0.173913 | 0.217391 | 0.000000 | 0.000000 | 0.021739 | 0.000000 | 0.000000 | 0.000000 |
| 3312 | 46000 | South Dakota | SD | 0.673640 | 0.305517 | 370047.0 | 117442.0 | 227701.0 | 8.253245 | 30.237791 | ... | 0.000000 | 0.000000 | 0.045455 | 0.060606 | 0.106061 | 0.151515 | 0.015152 | 0.303030 | 0.030303 | 0.166667 |
| 3313 | 47000 | Tennessee | TN | 0.747807 | 0.237236 | 2484691.0 | 867110.0 | 1517402.0 | 12.537143 | 32.088009 | ... | 0.073684 | 0.094737 | 0.073684 | 0.115789 | 0.052632 | 0.063158 | 0.031579 | 0.042105 | 0.000000 | 0.010526 |
| 3314 | 48000 | Texas | TX | 0.743895 | 0.245202 | 8903237.0 | 3867816.0 | 4681590.0 | 16.313875 | 24.957039 | ... | 0.047244 | 0.078740 | 0.074803 | 0.133858 | 0.051181 | 0.059055 | 0.070866 | 0.047244 | 0.051181 | 0.062992 |
| 3315 | 49000 | Utah | UT | 0.728772 | 0.241405 | 852461.0 | 237241.0 | 397004.0 | 7.719078 | 22.836246 | ... | 0.068966 | 0.000000 | 0.034483 | 0.068966 | 0.034483 | 0.068966 | 0.103448 | 0.068966 | 0.103448 | 0.103448 |
| 3316 | 50000 | Vermont | VT | 0.351706 | 0.615269 | 291413.0 | 178179.0 | 95053.0 | 7.327794 | 28.795351 | ... | 0.000000 | 0.000000 | 0.214286 | 0.071429 | 0.071429 | 0.214286 | 0.142857 | 0.000000 | 0.071429 | 0.000000 |
| 3317 | 51000 | Virginia | VA | 0.551998 | 0.431626 | 3844787.0 | 1916845.0 | 1731156.0 | 10.305691 | 23.953545 | ... | 0.000000 | 0.150376 | 0.030075 | 0.060150 | 0.045113 | 0.030075 | 0.030075 | 0.007519 | 0.007519 | 0.037594 |
| 3318 | 53000 | Washington | WA | 0.520402 | 0.448434 | 2765627.0 | 1523720.0 | 1043648.0 | 8.672709 | 21.999466 | ... | 0.102564 | 0.051282 | 0.076923 | 0.051282 | 0.051282 | 0.051282 | 0.025641 | 0.000000 | 0.000000 | 0.051282 |
| 3319 | 54000 | West Virginia | WV | 0.741402 | 0.243346 | 708226.0 | 187457.0 | 486198.0 | 13.097753 | 40.320034 | ... | 0.000000 | 0.000000 | 0.090909 | 0.127273 | 0.181818 | 0.054545 | 0.072727 | 0.054545 | 0.000000 | 0.036364 |
| 3320 | 55000 | Wisconsin | WI | 0.564259 | 0.419635 | 2944620.0 | 1382210.0 | 1409467.0 | 7.791683 | 30.638811 | ... | 0.055556 | 0.041667 | 0.125000 | 0.236111 | 0.027778 | 0.013889 | 0.041667 | 0.000000 | 0.027778 | 0.069444 |
| 3321 | 56000 | Wyoming | WY | 0.750912 | 0.217684 | 248742.0 | 55949.0 | 174248.0 | 6.834035 | 29.073072 | ... | 0.000000 | 0.000000 | 0.043478 | 0.043478 | 0.000000 | 0.260870 | 0.260870 | 0.086957 | 0.130435 | 0.086957 |
51 rows × 38 columns
# Sauvegarder le fichier enrichi
cleaned_state_df.to_excel('checkpoints/states_df.xlsx', index=False)
print("\nFichier 'checkpoints/states_df.xlsx' sauvegardé avec toutes les données intégrées.")
Fichier 'checkpoints/states_df.xlsx' sauvegardé avec toutes les données intégrées.
Parfait notre dataframe est clean et prêt à être utilisé pour le modèle.
4. ANALYSE EXPLORATOIRE¶
4.1. Recuperation , Inspection et Creation de la variable cible (Target)¶
df = pd.read_excel("checkpoints/states_df.xlsx")
df["target"] = (df["per_gop"] > df["per_dem"]).astype(int)
df = df.drop(columns=["is_state"], errors="ignore")
df.head()
| id | state_name | state_code | per_gop | per_dem | total_2016 | dem_2016 | gop_2016 | percent_no_highschool | percent_highschool | ... | uic_4 | uic_5 | uic_6 | uic_7 | uic_8 | uic_9 | uic_10 | uic_11 | uic_12 | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000 | Alabama | AL | 0.647359 | 0.342648 | 2078165 | 718084 | 1306925 | 13.819302 | 30.800268 | ... | 0.059701 | 0.104478 | 0.208955 | 0.029851 | 0.000000 | 0.000000 | 0.044776 | 0.074627 | 0.000000 | 1 |
| 1 | 2000 | Alaska | AK | 0.497797 | 0.420912 | 0 | 0 | 0 | 7.152934 | 28.003729 | ... | 0.000000 | 0.000000 | 0.000000 | 0.034483 | 0.068966 | 0.000000 | 0.068966 | 0.344828 | 0.379310 | 1 |
| 2 | 4000 | Arizona | AZ | 0.548723 | 0.435861 | 2062810 | 936250 | 1021154 | 12.860705 | 23.858877 | ... | 0.066667 | 0.133333 | 0.066667 | 0.000000 | 0.066667 | 0.066667 | 0.000000 | 0.000000 | 0.000000 | 1 |
| 3 | 5000 | Arkansas | AR | 0.688531 | 0.282032 | 1108615 | 378729 | 677904 | 13.430243 | 34.034885 | ... | 0.013333 | 0.053333 | 0.213333 | 0.040000 | 0.133333 | 0.146667 | 0.066667 | 0.026667 | 0.000000 | 1 |
| 4 | 6000 | California | CA | 0.439389 | 0.537068 | 9631972 | 5931283 | 3184721 | 16.692171 | 20.487896 | ... | 0.051724 | 0.068966 | 0.086207 | 0.034483 | 0.051724 | 0.000000 | 0.000000 | 0.034483 | 0.017241 | 0 |
5 rows × 39 columns
df.shape
(51, 39)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 51 entries, 0 to 50 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 51 non-null int64 1 state_name 51 non-null object 2 state_code 51 non-null object 3 per_gop 51 non-null float64 4 per_dem 51 non-null float64 5 total_2016 51 non-null int64 6 dem_2016 51 non-null int64 7 gop_2016 51 non-null int64 8 percent_no_highschool 51 non-null float64 9 percent_highschool 51 non-null float64 10 percent_college 51 non-null float64 11 percent_bachelor 51 non-null float64 12 percent_poverty 51 non-null float64 13 median_household_income 51 non-null int64 14 unemployment_rate 51 non-null float64 15 Employed_2019 51 non-null int64 16 Unemployed_2019 51 non-null int64 17 ruc_1 51 non-null float64 18 ruc_2 51 non-null float64 19 ruc_3 51 non-null float64 20 ruc_4 51 non-null float64 21 ruc_5 51 non-null float64 22 ruc_6 51 non-null float64 23 ruc_7 51 non-null float64 24 ruc_8 51 non-null float64 25 ruc_9 51 non-null float64 26 uic_1 51 non-null float64 27 uic_2 51 non-null float64 28 uic_3 51 non-null float64 29 uic_4 51 non-null float64 30 uic_5 51 non-null float64 31 uic_6 51 non-null float64 32 uic_7 51 non-null float64 33 uic_8 51 non-null float64 34 uic_9 51 non-null float64 35 uic_10 51 non-null float64 36 uic_11 51 non-null float64 37 uic_12 51 non-null float64 38 target 51 non-null int64 dtypes: float64(29), int64(8), object(2) memory usage: 15.7+ KB
df.describe()
| id | per_gop | per_dem | total_2016 | dem_2016 | gop_2016 | percent_no_highschool | percent_highschool | percent_college | percent_bachelor | ... | uic_4 | uic_5 | uic_6 | uic_7 | uic_8 | uic_9 | uic_10 | uic_11 | uic_12 | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 51.000000 | 51.000000 | 51.000000 | 5.100000e+01 | 5.100000e+01 | 5.100000e+01 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | ... | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 | 51.000000 |
| mean | 28960.784314 | 0.586317 | 0.392727 | 2.495477e+06 | 1.193607e+06 | 1.180349e+06 | 10.461532 | 28.010589 | 29.757235 | 31.770644 | ... | 0.034489 | 0.073865 | 0.088338 | 0.042785 | 0.092530 | 0.049669 | 0.050380 | 0.045169 | 0.052457 | 0.784314 |
| std | 15832.827649 | 0.146841 | 0.145555 | 2.386844e+06 | 1.278957e+06 | 1.076152e+06 | 2.706207 | 4.173287 | 4.023961 | 6.427930 | ... | 0.037211 | 0.054218 | 0.079990 | 0.043547 | 0.087541 | 0.055531 | 0.066516 | 0.063557 | 0.084690 | 0.415390 |
| min | 1000.000000 | 0.053973 | 0.198169 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 6.449651 | 16.835115 | 15.547361 | 20.614605 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 16500.000000 | 0.503118 | 0.279586 | 7.360890e+05 | 2.703425e+05 | 3.713010e+05 | 8.253462 | 25.779302 | 27.592871 | 27.843889 | ... | 0.000000 | 0.038075 | 0.032998 | 0.000000 | 0.029963 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 50% | 29000.000000 | 0.602136 | 0.376106 | 1.922218e+06 | 7.795350e+05 | 9.479340e+05 | 9.796140 | 28.003729 | 29.736067 | 31.255686 | ... | 0.024096 | 0.071429 | 0.071429 | 0.034483 | 0.068966 | 0.031579 | 0.034483 | 0.027778 | 0.017241 | 1.000000 |
| 75% | 41500.000000 | 0.693200 | 0.462139 | 3.088076e+06 | 1.680510e+06 | 1.545866e+06 | 12.696142 | 30.719540 | 32.605509 | 34.425982 | ... | 0.057821 | 0.106456 | 0.128571 | 0.058586 | 0.136111 | 0.078030 | 0.068966 | 0.069048 | 0.065979 | 1.000000 |
| max | 56000.000000 | 0.780742 | 0.921497 | 9.631972e+06 | 5.931283e+06 | 4.681590e+06 | 16.692171 | 40.320034 | 36.730377 | 58.540707 | ... | 0.150376 | 0.214286 | 0.375000 | 0.181818 | 0.400000 | 0.260870 | 0.303030 | 0.344828 | 0.379310 | 1.000000 |
8 rows × 37 columns
4.2. Distribution des votes pour les différents partis (Democrates et Republicains)¶
plt.figure(figsize=(12, 6))
sns.histplot(df['per_gop'], bins=30, color='blue', label='Votes Républicains', kde=True)
sns.histplot(df['per_dem'], bins=30, color='red', label='Votes Democrates', kde=True)
plt.title('Distribution des votes pour Démocrates et Republicains')
plt.xlabel('Nombre de Votes')
plt.ylabel('Fréquence')
plt.legend()
plt.show()
Sur ce graphique, on observe la distribution des pourcentages de vote par état pour les Républicains (bleu) et les Démocrates (rouge).
- Les états à tendance démocrate affichent généralement des pourcentages entre 20% et 50%, avec un pic autour de 30%, tandis que les états républicains montrent des pourcentages plus élevés, entre 50% et 80%, avec une concentration maximale vers 70-75%.
On remarque donc qu'il y a peu d'états avec un équilibre proche de 50%, ce qui suggère une division politique assez nette entre les états américains.
Via un boxplot, essayons de voir s’il existe une différence significative dans les niveaux de chômage en fonction du parti qui a remporté le comté.
plt.figure(figsize=(10, 6))
sns.boxplot(x='target', y='unemployment_rate', data=df)
plt.title('Boxplot de taux de chômage par parti politique')
plt.xlabel('Target Variable (0 = Démocrates, 1 = Républicains)')
plt.ylabel('Taux de chômage')
plt.show()
Ce boxplot compare les taux de chômage entre les états à majorité démocrate (0) et républicaine (1).
- Les états républicains présentent une médiane légèrement inférieure mais une dispersion plus importante, avec des valeurs extrêmes plus élevées (jusqu'à 6%).
- Les états démocrates montrent une distribution plus resserrée avec une médiane autour de 3.6%.
Bien que les deux groupes aient des taux de chômage globalement similaires, les républicains présentent à la fois des états avec les taux les plus bas et les plus élevés, suggérant une plus grande variabilité économique entre ces états.
4.3. Analyses Univariées¶
colors = {'dem': '#3333FF', 'gop': '#FF3333'}
# Création d'une figure pour regrouper les analyses univariées
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
# 1. Variables politiques - Histogrammes
ax1 = fig.add_subplot(gs[0, 0])
sns.histplot(df['per_gop'], bins=30, color=colors['gop'], kde=True, ax=ax1)
ax1.set_title('Distribution des votes Républicains par état', fontsize=14)
ax1.set_xlabel('Pourcentage de votes Républicains (%)')
ax1.set_ylabel('Nombre d\'états')
ax2 = fig.add_subplot(gs[0, 1])
sns.histplot(df['per_dem'], bins=30, color=colors['dem'], kde=True, ax=ax2)
ax2.set_title('Distribution des votes Démocrates par état', fontsize=14)
ax2.set_xlabel('Pourcentage de votes Démocrates (%)')
ax2.set_ylabel('Nombre d\'états')
Text(0, 0.5, "Nombre d'états")
# 2. Variables éducatives - Histogrammes
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
ax3 = fig.add_subplot(gs[1, 0])
education_vars = ['percent_no_highschool', 'percent_highschool',
'percent_college', 'percent_bachelor']
education_labels = ['Sans diplôme', 'Diplôme secondaire',
'Études supérieures', 'Licence ou plus']
colors_edu = ['#FF9999', '#99FF99', '#9999FF', '#FFFF99']
for i, (var, label, color) in enumerate(zip(education_vars, education_labels, colors_edu)):
sns.histplot(df[var], bins=20, kde=True, color=color, alpha=0.7,
label=label, ax=ax3)
ax3.set_title('Distribution des niveaux d\'éducation par état', fontsize=14)
ax3.set_xlabel('Pourcentage de la population (%)')
ax3.set_ylabel('Nombre d\'états')
ax3.legend()
<matplotlib.legend.Legend at 0x7fdac5e76320>
# 3. Variables économiques - Histogrammes
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
ax4 = fig.add_subplot(gs[1, 1])
sns.histplot(df['unemployment_rate'], bins=20, kde=True, color='#66CCFF', ax=ax4)
ax4.set_title('Distribution du taux de chômage par état', fontsize=14)
ax4.set_xlabel('Taux de chômage (%)')
ax4.set_ylabel('Nombre d\'états')
ax5 = fig.add_subplot(gs[2, 0])
sns.histplot(df['percent_poverty'], bins=20, kde=True, color='#FF6666', ax=ax5)
ax5.set_title('Distribution du taux de pauvreté par état', fontsize=14)
ax5.set_xlabel('Taux de pauvreté (%)')
ax5.set_ylabel('Nombre d\'états')
ax6 = fig.add_subplot(gs[2, 1])
sns.histplot(df['median_household_income'], bins=20, kde=True, color='#66CC66', ax=ax6)
ax6.set_title('Distribution du revenu médian des ménages par état', fontsize=14)
ax6.set_xlabel('Revenu médian ($)')
ax6.set_ylabel('Nombre d\'états')
Text(0, 0.5, "Nombre d'états")
# 4. Variables géographiques - Distribution des codes RUCC (Rural-Urban Continuum Code)
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
ax7 = fig.add_subplot(gs[3, 0])
ruc_columns = [col for col in df.columns if col.startswith('ruc_')]
ruc_means = df[ruc_columns].mean().sort_values(ascending=False)
ruc_categories = ['Métropolitain, >1M', 'Métropolitain, 250k-1M', 'Métropolitain, <250k',
'Urbain, >20k, adj. métro', 'Urbain, >20k, non-adj. métro',
'Urbain, 2.5k-20k, adj. métro', 'Urbain, 2.5k-20k, non-adj. métro',
'Rural, adj. métro', 'Rural, non-adj. métro']
colors_ruc = plt.cm.Spectral(np.linspace(0, 1, len(ruc_means)))
ax7.pie(ruc_means, labels=[f"RUC {i+1}" for i in range(len(ruc_means))],
autopct='%1.1f%%', startangle=90, colors=colors_ruc)
ax7.set_title('Distribution des codes Rural-Urban Continuum (RUC)', fontsize=14)
Text(0.5, 1.0, 'Distribution des codes Rural-Urban Continuum (RUC)')
# 5. Variables géographiques - Distribution des codes UIC (Urban Influence Codes)
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
ax8 = fig.add_subplot(gs[3, 1])
uic_columns = [col for col in df.columns if col.startswith('uic_')]
uic_means = df[uic_columns].mean().sort_values(ascending=False)
colors_uic = plt.cm.tab20(np.linspace(0, 1, len(uic_means)))
ax8.pie(uic_means, labels=[f"UIC {i+1}" for i in range(len(uic_means))],
autopct='%1.1f%%', startangle=90, colors=colors_uic)
ax8.set_title('Distribution des codes Urban Influence (UIC)', fontsize=14)
Text(0.5, 1.0, 'Distribution des codes Urban Influence (UIC)')
4.4. Analyses Bivariées¶
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
ax9 = fig.add_subplot(gs[4, 0])
# Utilisation correcte de `palette` dans `sns.boxplot`
sns.boxplot(
x='target',
y='unemployment_rate',
data=df,
hue='target',
legend=False,
ax=ax9
)
ax9.set_title('Taux de chômage par affiliation politique', fontsize=14)
ax9.set_xlabel('Affiliation politique (0 = Démocrates, 1 = Républicains)')
ax9.set_ylabel('Taux de chômage (%)')
plt.show()
# 2. Boxplots des variables d'éducation par parti
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
education_fig, education_axes = plt.subplots(2, 2, figsize=(15, 12))
education_axes = education_axes.flatten()
for i, (var, label) in enumerate(zip(education_vars, education_labels)):
sns.boxplot(x='target', y=var, data=df, hue='target', ax=education_axes[i])
education_axes[i].set_title(f'{label} par affiliation politique', fontsize=12)
education_axes[i].set_xlabel('Affiliation politique (0 = Démocrates, 1 = Républicains)')
education_axes[i].set_ylabel(f'Pourcentage de {label.lower()} (%)')
education_fig.tight_layout()
fig.add_subplot(gs[4, 1])
plt.close(education_fig)
fig.add_subplot(gs[4, 1]).imshow(np.array(education_fig.canvas.renderer._renderer))
fig.add_subplot(gs[4, 1]).axis('off')
(np.float64(0.0), np.float64(1.0), np.float64(0.0), np.float64(1.0))
# 3. Boxplots des variables économiques par parti
fig = plt.figure(figsize=(20, 30))
gs = gridspec.GridSpec(6, 2, figure=fig)
ax10 = fig.add_subplot(gs[5, 0])
sns.boxplot(x='target', y='percent_poverty', data=df, hue='target',
palette={0: colors['dem'], 1: colors['gop']}, ax=ax10)
ax10.set_title('Taux de pauvreté par affiliation politique', fontsize=14)
ax10.set_xlabel('Affiliation politique (0 = Démocrates, 1 = Républicains)')
ax10.set_ylabel('Taux de pauvreté (%)')
ax11 = fig.add_subplot(gs[5, 1])
sns.boxplot(x='target', y='median_household_income',hue='target',palette={0: colors['dem'], 1: colors['gop']}, data=df, ax=ax11)
ax11.set_title('Revenu médian des ménages par affiliation politique', fontsize=14)
ax11.set_xlabel('Affiliation politique (0 = Démocrates, 1 = Républicains)')
ax11.set_ylabel('Revenu médian ($)')
fig.tight_layout(pad=3.0)
plt.show()
df.columns
Index(['id', 'state_name', 'state_code', 'per_gop', 'per_dem', 'total_2016',
'dem_2016', 'gop_2016', 'percent_no_highschool', 'percent_highschool',
'percent_college', 'percent_bachelor', 'percent_poverty',
'median_household_income', 'unemployment_rate', 'Employed_2019',
'Unemployed_2019', 'ruc_1', 'ruc_2', 'ruc_3', 'ruc_4', 'ruc_5', 'ruc_6',
'ruc_7', 'ruc_8', 'ruc_9', 'uic_1', 'uic_2', 'uic_3', 'uic_4', 'uic_5',
'uic_6', 'uic_7', 'uic_8', 'uic_9', 'uic_10', 'uic_11', 'uic_12',
'target'],
dtype='object')
# 4. Matrices de corrélation (avec et sans variables socio-économiques)
socio_eco_vars = ['percent_no_highschool', 'percent_highschool',
'percent_college', 'percent_bachelor', 'percent_poverty',
'median_household_income', 'unemployment_rate']
df_corr = df[socio_eco_vars + ['target']]
# Calcul de la corrélation de Pearson
corr_matrix = df_corr.corr()
# Visualisation de la matrice de corrélation sous forme de heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Matrice de Corrélation entre les Variables(sociaux économiques) et la Cible')
plt.show()
plt.figure(figsize=(16, 12))
socio_eco_vars = ['percent_no_highschool', 'percent_highschool',
'percent_college', 'percent_bachelor', 'percent_poverty',
'median_household_income', 'unemployment_rate']
corr_matrix = df[socio_eco_vars].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0,
square=True, linewidths=.5, annot=True, fmt=".2f")
plt.title('Matrice de corrélation des variables socio-économiques et politiques', fontsize=16)
plt.show()
# 5. Scatter plots des variables les plus corrélées avec l'affiliation politique
plt.figure(figsize=(18, 10))
potential_predictors = ['percent_no_highschool', 'percent_bachelor',
'percent_poverty', 'median_household_income']
for i, var in enumerate(potential_predictors):
plt.subplot(2, 2, i+1)
sns.scatterplot(x=var, y='per_gop', data=df, color=colors['gop'],
alpha=0.7, label='Républicains')
sns.scatterplot(x=var, y='per_dem', data=df, color=colors['dem'],
alpha=0.7, label='Démocrates')
sns.regplot(x=var, y='per_gop', data=df, color=colors['gop'],
scatter=False, ci=None, line_kws={"linestyle": "--"})
sns.regplot(x=var, y='per_dem', data=df, color=colors['dem'],
scatter=False, ci=None, line_kws={"linestyle": "--"})
plt.title(f'Relation entre {var} et le pourcentage de votes', fontsize=12)
plt.xlabel(var)
plt.ylabel('Pourcentage de votes')
plt.legend()
plt.tight_layout()
plt.show()
# 6. Analyse de la composition rurale/urbaine par parti politique (Regroupement des codes RUC (Rural Urban Codes))
# RUC 1-3: Zone métropolitaines
# RUC 4-7: Zone urbaine
# RUC 8-9: Zone rurale
# Créer des colonnes agrégées pour l'urbanité
df['urban_pct'] = df[['ruc_1', 'ruc_2', 'ruc_3']].sum(axis=1) # Zones métropolitaines
df['semi_urban_pct'] = df[['ruc_4', 'ruc_5', 'ruc_6', 'ruc_7']].sum(axis=1) # Zones urbaines
df['rural_pct'] = df[['ruc_8', 'ruc_9']].sum(axis=1) # Zones rurales
plt.figure(figsize=(14, 7))
rural_vars = ['urban_pct', 'semi_urban_pct', 'rural_pct']
rural_labels = ['Métropolitain', 'Urbain', 'Rural']
for i, (var, label) in enumerate(zip(rural_vars, rural_labels)):
plt.subplot(1, 3, i+1)
sns.boxplot(x='target', y=var, data=df, hue='target', palette={0: colors['dem'], 1: colors['gop']})
plt.title(f'Distribution {label} (RUC) par affiliation politique', fontsize=12)
plt.xlabel('Affiliation politique (0 = Démocrates, 1 = Républicains)')
plt.ylabel(f'Pourcentage {label} (%)')
plt.tight_layout()
plt.show()
# 7. Regroupement des codes UIC (Urban Influence Codes)
# UIC 1-2: Grands comtés métropolitains
# UIC 3-7: Comtés métropolitains de petite taille ou sous influence métropolitaine
# UIC 8-12: Comtés non-métropolitains ou ruraux
# Création des variables agrégées pour les UIC
df['large_metro_uic'] = df[['uic_1', 'uic_2']].sum(axis=1) # Grands comtés métropolitains
df['small_metro_uic'] = df[['uic_3', 'uic_4', 'uic_5', 'uic_6', 'uic_7']].sum(axis=1) # Petits comtés métropolitains ou sous influence
df['rural_uic'] = df[['uic_8', 'uic_9', 'uic_10', 'uic_11', 'uic_12']].sum(axis=1) # Comtés ruraux
# Visualisation similaire à celle que vous avez faite pour les RUC
plt.figure(figsize=(14, 7))
uic_vars = ['large_metro_uic', 'small_metro_uic', 'rural_uic']
uic_labels = ['Grands Métropolitains', 'Petits Métropolitains', 'Ruraux']
for i, (var, label) in enumerate(zip(uic_vars, uic_labels)):
plt.subplot(1, 3, i+1)
sns.boxplot(x='target', y=var, data=df, hue='target', palette={0: colors['dem'], 1: colors['gop']})
plt.title(f'{label} (UIC) par affiliation politique', fontsize=12)
plt.xlabel('Affiliation politique (0 = Démocrates, 1 = Républicains)')
plt.ylabel(f'Pourcentage {label} (%)')
plt.tight_layout()
plt.show()
# Ajoutez également ces variables à votre analyse de corrélation
# Pour voir leur relation avec la variable cible
all_vars = socio_eco_vars + rural_vars + uic_vars + ['target']
full_corr_matrix = df[all_vars].corr()
# Visualisation de la corrélation des nouvelles variables avec la cible
target_corr = full_corr_matrix['target'].sort_values(ascending=False)
print("Corrélation avec la variable cible:")
print(target_corr)
Corrélation avec la variable cible: target 1.000000 percent_college 0.571915 small_metro_uic 0.565129 semi_urban_pct 0.519181 rural_pct 0.456753 percent_poverty 0.380079 rural_uic 0.370999 percent_highschool 0.288327 percent_no_highschool 0.129814 unemployment_rate 0.090066 percent_bachelor -0.599873 median_household_income -0.659655 large_metro_uic -0.663431 urban_pct -0.663431 Name: target, dtype: float64
# 8. Stacked bar chart de la composition rurale/urbaine par parti (RUC)
df_dem = df[df['target'] == 0]
df_rep = df[df['target'] == 1]
dem_rural_means = [df_dem['urban_pct'].mean(), df_dem['semi_urban_pct'].mean(), df_dem['rural_pct'].mean()]
rep_rural_means = [df_rep['urban_pct'].mean(), df_rep['semi_urban_pct'].mean(), df_rep['rural_pct'].mean()]
plt.figure(figsize=(10, 6))
width = 0.35
x = np.arange(len(rural_labels))
plt.bar(x - width/2, dem_rural_means, width, color=colors['dem'], alpha=0.7, label='Démocrates')
plt.bar(x + width/2, rep_rural_means, width, color=colors['gop'], alpha=0.7, label='Républicains')
plt.xlabel('Type de zone')
plt.ylabel('Pourcentage moyen (%)')
plt.title('Composition metropolitaine moyenne par affiliation politique (RUC)')
plt.xticks(x, rural_labels)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
# 9. Stacked bar chart de la composition urbaine par parti (UIC)
df_dem = df[df['target'] == 0]
df_rep = df[df['target'] == 1]
dem_urban_means = [df_dem['large_metro_uic'].mean(), df_dem['small_metro_uic'].mean(), df_dem['rural_uic'].mean()]
rep_urban_means = [df_rep['large_metro_uic'].mean(), df_rep['small_metro_uic'].mean(), df_rep['rural_uic'].mean()]
plt.figure(figsize=(10, 6))
width = 0.35
x = np.arange(len(uic_labels))
plt.bar(x - width/2, dem_urban_means, width, color=colors['dem'], alpha=0.7, label='Démocrates')
plt.bar(x + width/2, rep_urban_means, width, color=colors['gop'], alpha=0.7, label='Républicains')
plt.xlabel('Type de zone')
plt.ylabel('Pourcentage moyen (%)')
plt.title('Composition urbaine/rurale moyenne par affiliation politique (UIC)')
plt.xticks(x, uic_labels)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
# Ajoutez également ces variables à votre analyse de corrélation
# Pour voir leur relation avec la variable cible
all_vars = socio_eco_vars + rural_vars + uic_vars + ['target']
full_corr_matrix = df[all_vars].corr()
# Visualisation de la corrélation des nouvelles variables avec la cible
target_corr = full_corr_matrix['target'].sort_values(ascending=False)
print("Corrélation avec la variable cible:")
print(target_corr)
Corrélation avec la variable cible: target 1.000000 percent_college 0.571915 small_metro_uic 0.565129 semi_urban_pct 0.519181 rural_pct 0.456753 percent_poverty 0.380079 rural_uic 0.370999 percent_highschool 0.288327 percent_no_highschool 0.129814 unemployment_rate 0.090066 percent_bachelor -0.599873 median_household_income -0.659655 large_metro_uic -0.663431 urban_pct -0.663431 Name: target, dtype: float64
# Visualisation de la matrice de corrélation sous forme de heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(full_corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Matrice de Corrélation entre les Variables et la Cible')
plt.show()
df['margin'] = (df['per_gop'] - df['per_dem'])*100
print(df)
# ---- Carte 1: Tendances électorales avec une échelle de couleur continue ----
fig1 = px.choropleth(
df,
locations='state_code',
locationmode='USA-states',
color='margin', # Utiliser la marge au lieu d'une valeur binaire
color_continuous_scale='RdBu_r', # Rouge pour GOP, Bleu pour DEM
range_color=[-30, 30], # Limiter l'échelle pour mieux voir les différences
scope='usa',
title='Tendances électorales par état (écart GOP-DEM)',
labels={'margin': 'Écart GOP-DEM (%)'}
)
fig1.update_layout(
geo=dict(showframe=False, showcoastlines=True, projection_scale=1.1),
title_font_size=18,
margin=dict(t=50, b=0, l=0, r=0)
)
# ---- Carte 2: Pourcentage rural avec une échelle de couleur continue ----
fig2 = px.choropleth(
df,
locations='state_code',
locationmode='USA-states',
color='rural_pct',
color_continuous_scale='Greens',
scope='usa',
title='Pourcentage de zones rurales par état',
labels={'rural_pct': '% Rural'}
)
fig2.update_layout(
geo=dict(showframe=False, showcoastlines=True, projection_scale=1.1),
title_font_size=18,
margin=dict(t=50, b=0, l=0, r=0)
)
# ---- Carte 3: Revenu médian des ménages ----
fig3 = px.choropleth(
df,
locations='state_code',
locationmode='USA-states',
color='median_household_income',
color_continuous_scale='Viridis',
scope='usa',
title='Revenu médian des ménages par état',
labels={'median_household_income': 'Revenu médian ($)'}
)
fig3.update_layout(
geo=dict(showframe=False, showcoastlines=True, projection_scale=1.1),
title_font_size=18,
margin=dict(t=50, b=0, l=0, r=0)
)
# ---- Carte 4: Taux de chômage ----
fig4 = px.choropleth(
df,
locations='state_code',
locationmode='USA-states',
color='unemployment_rate',
color_continuous_scale='Reds',
scope='usa',
title='Taux de chômage par état',
labels={'unemployment_rate': 'Taux de chômage (%)'}
)
fig4.update_layout(
geo=dict(showframe=False, showcoastlines=True, projection_scale=1.1),
title_font_size=18,
margin=dict(t=50, b=0, l=0, r=0)
)
# ---- Carte 5: Niveau d'éducation (pourcentage avec un diplôme universitaire) ----
fig5 = px.choropleth(
df,
locations='state_code',
locationmode='USA-states',
color='percent_bachelor',
color_continuous_scale='Blues',
scope='usa',
title="Pourcentage de la population avec un diplôme universitaire",
labels={'percent_bachelor': '% Diplôme universitaire'}
)
fig5.update_layout(
geo=dict(showframe=False, showcoastlines=True, projection_scale=1.1),
title_font_size=18,
margin=dict(t=50, b=0, l=0, r=0)
)
# ---- Carte 6: Taux de pauvreté ----
fig6 = px.choropleth(
df,
locations='state_code',
locationmode='USA-states',
color='percent_poverty',
color_continuous_scale='OrRd',
scope='usa',
title="Taux de pauvreté par état",
labels={'percent_poverty': '% Pauvreté'}
)
fig6.update_layout(
geo=dict(showframe=False, showcoastlines=True, projection_scale=1.1),
title_font_size=18,
margin=dict(t=50, b=0, l=0, r=0)
)
# ---- Scatter plot: Relation entre ruralité et vote républicain ----
fig7 = px.scatter(
df,
x='rural_pct',
y='per_gop',
size='total_2016',
color='margin',
color_continuous_scale='RdBu_r',
hover_name='state_name',
title="Relation entre ruralité et vote républicain",
labels={
'rural_pct': '% Rural',
'per_gop': '% Vote républicain',
'total_2016': 'Total votes 2016',
'margin': 'Marge GOP-DEM'
},
size_max=40
)
fig7.update_layout(
title_font_size=18,
xaxis_title_font_size=14,
yaxis_title_font_size=14,
coloraxis_colorbar_title_font_size=14
)
# ---- Scatter plot: Relation entre éducation et vote républicain ----
fig8 = px.scatter(
df,
x='percent_bachelor',
y='per_gop',
size='total_2016',
color='median_household_income',
color_continuous_scale='Viridis',
hover_name='state_name',
title="Relation entre niveau d'éducation et vote républicain",
labels={
'percent_bachelor': '% Diplôme universitaire',
'per_gop': '% Vote républicain',
'total_2016': 'Total votes 2016',
'median_household_income': 'Revenu médian ($)'
},
size_max=40
)
fig8.update_layout(
title_font_size=18,
xaxis_title_font_size=14,
yaxis_title_font_size=14,
coloraxis_colorbar_title_font_size=14
)
# ---- Heatmap: Corrélation entre les variables ----
# Sélection des variables numériques pertinentes
cols_to_corr = ['per_gop', 'per_dem', 'margin', 'rural_pct', 'percent_no_highschool',
'percent_bachelor', 'percent_poverty', 'median_household_income',
'unemployment_rate']
# Calcul de la matrice de corrélation
corr_matrix = df[cols_to_corr].corr()
# Créer une heatmap de corrélation
fig9 = px.imshow(
corr_matrix,
text_auto='.2f',
color_continuous_scale='RdBu_r',
title="Matrice de corrélation entre les variables",
labels=dict(x="Variables", y="Variables", color="Corrélation")
)
fig9.update_layout(
title_font_size=18,
xaxis_title_font_size=14,
yaxis_title_font_size=14
)
# ---- Bar plot: Top 10 des états les plus républicains et démocrates ----
# On Créé un dataframe pour les 10 états les plus républicains et les 10 les plus démocrates
top_gop = df.sort_values('per_gop', ascending=False).head(10)
top_dem = df.sort_values('per_dem', ascending=False).head(10)
fig10 = make_subplots(rows=1, cols=2, subplot_titles=("Top 10 des états républicains", "Top 10 des états démocrates"))
fig10.add_trace(
go.Bar(
x=top_gop['state_name'],
y=top_gop['per_gop'],
marker_color='red',
name='% Républicain'
),
row=1, col=1
)
fig10.add_trace(
go.Bar(
x=top_dem['state_name'],
y=top_dem['per_dem'],
marker_color='blue',
name='% Démocrate'
),
row=1, col=2
)
fig10.update_layout(
title_text="Top 10 des états par affiliation politique",
title_font_size=18,
showlegend=True,
height=500
)
# Affichons tous les graphiques
for fig in [fig1, fig2, fig3, fig4, fig5, fig6, fig7, fig8, fig9, fig10]:
fig.show()
id state_name state_code per_gop per_dem total_2016 \
0 1000 Alabama AL 0.647359 0.342648 2078165
1 2000 Alaska AK 0.497797 0.420912 0
2 4000 Arizona AZ 0.548723 0.435861 2062810
3 5000 Arkansas AR 0.688531 0.282032 1108615
4 6000 California CA 0.439389 0.537068 9631972
5 8000 Colorado CO 0.559502 0.417248 2564185
6 9000 Connecticut CT 0.424576 0.557866 1623542
7 10000 Delaware DE 0.443037 0.542737 441535
8 11000 District of Columbia DC 0.053973 0.921497 280272
9 12000 Florida FL 0.633620 0.357409 9386750
10 13000 Georgia GA 0.639809 0.350515 4029564
11 15000 Hawaii HI 0.330023 0.648358 428825
12 16000 Idaho ID 0.730509 0.240739 688235
13 17000 Illinois IL 0.652193 0.327196 5374280
14 18000 Indiana IN 0.688717 0.291546 2722029
15 19000 Iowa IA 0.638318 0.344197 1542880
16 20000 Kansas KS 0.752552 0.227409 1147143
17 21000 Kentucky KY 0.740423 0.245121 1922218
18 22000 Louisiana LA 0.646492 0.339012 2027731
19 23000 Maine ME 0.486435 0.485294 741550
20 24000 Maryland MD 0.477075 0.498362 2474543
21 25000 Massachusetts MA 0.308799 0.668589 3231531
22 26000 Michigan MI 0.596681 0.387700 4789450
23 27000 Minnesota MN 0.602136 0.376106 2916404
24 28000 Mississippi MS 0.562834 0.423827 1162987
25 29000 Missouri MO 0.752084 0.232851 2775098
26 30000 Montana MT 0.689422 0.287197 483574
27 31000 Nebraska NE 0.780742 0.198169 805638
28 32000 Nevada NV 0.696978 0.277140 1122990
29 33000 New Hampshire NH 0.458564 0.523999 730628
30 34000 New Jersey NJ 0.437915 0.544962 3674893
31 35000 New Mexico NM 0.532221 0.447902 783127
32 36000 New York NY 0.508439 0.471869 7046175
33 37000 North Carolina NC 0.584579 0.403189 4629471
34 38000 North Dakota ND 0.725569 0.247834 336968
35 39000 Ohio OH 0.674596 0.310536 5325395
36 40000 Oklahoma OK 0.778398 0.202755 1451056
37 41000 Oregon OR 0.566156 0.404151 1808575
38 42000 Pennsylvania PA 0.635927 0.350978 5970107
39 44000 Rhode Island RI 0.380590 0.598526 450121
40 45000 South Carolina SC 0.535671 0.452410 2084444
41 46000 South Dakota SD 0.673640 0.305517 370047
42 47000 Tennessee TN 0.747807 0.237236 2484691
43 48000 Texas TX 0.743895 0.245202 8903237
44 49000 Utah UT 0.728772 0.241405 852461
45 50000 Vermont VT 0.351706 0.615269 291413
46 51000 Virginia VA 0.551998 0.431626 3844787
47 53000 Washington WA 0.520402 0.448434 2765627
48 54000 West Virginia WV 0.741402 0.243346 708226
49 55000 Wisconsin WI 0.564259 0.419635 2944620
50 56000 Wyoming WY 0.750912 0.217684 248742
dem_2016 gop_2016 percent_no_highschool percent_highschool ... \
0 718084 1306925 13.819302 30.800268 ...
1 0 0 7.152934 28.003729 ...
2 936250 1021154 12.860705 23.858877 ...
3 378729 677904 13.430243 34.034885 ...
4 5931283 3184721 16.692171 20.487896 ...
5 1212209 1137455 8.253678 21.368059 ...
6 884432 668266 9.369879 26.854712 ...
7 235581 185103 9.982669 31.292805 ...
8 260223 11553 9.076816 16.835115 ...
9 4485745 4605515 11.810859 28.573500 ...
10 1837300 2068623 12.855142 27.714716 ...
11 266827 128815 8.028257 27.356155 ...
12 189677 407199 9.226700 27.356852 ...
13 2977498 2118179 10.787586 25.954943 ...
14 1031953 1556220 11.181375 33.406757 ...
15 650790 798923 7.908792 30.982542 ...
16 414788 656009 9.048409 25.906137 ...
17 628834 1202942 13.738696 32.893917 ...
18 779535 1178004 14.773869 33.962753 ...
19 354873 334838 7.389770 31.473530 ...
20 1497951 873646 9.796140 24.611042 ...
21 1964768 1083069 9.242436 24.019262 ...
22 2267373 2279210 9.190457 28.873878 ...
23 1366676 1322891 6.859513 24.647589 ...
24 462001 678457 15.493731 30.438028 ...
25 1054889 1585753 10.078580 30.617037 ...
26 174521 274120 6.449651 28.832081 ...
27 273858 485819 8.595745 26.106092 ...
28 537753 511319 13.309174 28.085070 ...
29 348126 345598 6.894038 27.419645 ...
30 2021756 1535513 10.183736 27.185795 ...
31 380724 315875 14.411727 26.460430 ...
32 4143874 2640570 13.179301 25.977776 ...
33 2162074 2339603 12.219548 25.652466 ...
34 93526 216133 7.351314 26.429096 ...
35 2317001 2771984 9.621357 33.037495 ...
36 419788 947934 11.976947 31.330032 ...
37 934631 742506 9.287846 22.735300 ...
38 2844705 2912941 9.480545 34.693886 ...
39 249902 179421 11.186517 28.267363 ...
40 849469 1143611 12.488965 29.103146 ...
41 117442 227701 8.253245 30.237791 ...
42 867110 1517402 12.537143 32.088009 ...
43 3867816 4681590 16.313875 24.957039 ...
44 237241 397004 7.719078 22.836246 ...
45 178179 95053 7.327794 28.795351 ...
46 1916845 1731156 10.305691 23.953545 ...
47 1523720 1043648 8.672709 21.999466 ...
48 187457 486198 13.097753 40.320034 ...
49 1382210 1409467 7.791683 30.638811 ...
50 55949 174248 6.834035 29.073072 ...
uic_11 uic_12 target urban_pct semi_urban_pct rural_pct \
0 0.074627 0.000000 1 0.432836 0.402985 0.164179
1 0.344828 0.379310 1 0.103448 0.310345 0.586207
2 0.000000 0.000000 1 0.533333 0.466667 0.000000
3 0.026667 0.000000 1 0.266667 0.560000 0.173333
4 0.034483 0.017241 0 0.637931 0.293103 0.068966
5 0.125000 0.171875 1 0.265625 0.421875 0.312500
6 0.000000 0.000000 0 0.875000 0.125000 0.000000
7 0.000000 0.000000 0 1.000000 0.000000 0.000000
8 0.000000 0.000000 0 1.000000 0.000000 0.000000
9 0.000000 0.014925 1 0.656716 0.313433 0.029851
10 0.012579 0.025157 1 0.465409 0.396226 0.138365
11 0.000000 0.000000 0 0.600000 0.400000 0.000000
12 0.045455 0.045455 1 0.272727 0.500000 0.227273
13 0.019608 0.019608 1 0.392157 0.509804 0.098039
14 0.000000 0.000000 1 0.478261 0.467391 0.054348
15 0.050505 0.040404 1 0.212121 0.585859 0.202020
16 0.066667 0.209524 1 0.180952 0.419048 0.400000
17 0.041667 0.075000 1 0.291667 0.408333 0.300000
18 0.031250 0.000000 1 0.546875 0.375000 0.078125
19 0.187500 0.000000 1 0.312500 0.562500 0.125000
20 0.000000 0.000000 0 0.791667 0.208333 0.000000
21 0.071429 0.000000 0 0.785714 0.214286 0.000000
22 0.108434 0.024096 1 0.313253 0.518072 0.168675
23 0.034483 0.068966 1 0.310345 0.471264 0.218391
24 0.000000 0.000000 1 0.207317 0.536585 0.256098
25 0.034783 0.017391 1 0.295652 0.443478 0.260870
26 0.160714 0.285714 1 0.089286 0.392857 0.517857
27 0.086022 0.204301 1 0.139785 0.311828 0.548387
28 0.117647 0.000000 1 0.235294 0.529412 0.235294
29 0.000000 0.000000 0 0.300000 0.700000 0.000000
30 0.000000 0.000000 0 1.000000 0.000000 0.000000
31 0.090909 0.060606 1 0.212121 0.606061 0.181818
32 0.000000 0.000000 1 0.612903 0.370968 0.016129
33 0.010000 0.030000 1 0.460000 0.380000 0.160000
34 0.037736 0.264151 1 0.113208 0.188679 0.698113
35 0.000000 0.000000 1 0.431818 0.545455 0.022727
36 0.012987 0.012987 1 0.233766 0.558442 0.207792
37 0.055556 0.083333 1 0.361111 0.500000 0.138889
38 0.000000 0.000000 1 0.552239 0.388060 0.059701
39 0.000000 0.000000 0 1.000000 0.000000 0.000000
40 0.000000 0.000000 1 0.565217 0.413043 0.021739
41 0.030303 0.166667 1 0.121212 0.242424 0.636364
42 0.000000 0.010526 1 0.442105 0.389474 0.168421
43 0.051181 0.062992 1 0.322835 0.484252 0.192913
44 0.103448 0.103448 1 0.344828 0.482759 0.172414
45 0.071429 0.000000 0 0.214286 0.571429 0.214286
46 0.007519 0.037594 1 0.601504 0.240602 0.157895
47 0.000000 0.051282 1 0.538462 0.333333 0.128205
48 0.000000 0.036364 1 0.381818 0.418182 0.200000
49 0.027778 0.069444 1 0.361111 0.458333 0.180556
50 0.130435 0.086957 1 0.086957 0.739130 0.173913
large_metro_uic small_metro_uic rural_uic margin
0 0.432836 0.447761 0.119403 30.471133
1 0.103448 0.034483 0.862069 7.688521
2 0.533333 0.333333 0.133333 11.286270
3 0.266667 0.360000 0.373333 40.649831
4 0.637931 0.258621 0.103448 -9.767912
5 0.265625 0.187500 0.546875 14.225424
6 0.875000 0.125000 0.000000 -13.329063
7 1.000000 0.000000 0.000000 -9.969955
8 1.000000 0.000000 0.000000 -86.752373
9 0.656716 0.328358 0.014925 27.621053
10 0.465409 0.364780 0.169811 28.929338
11 0.600000 0.000000 0.400000 -31.833484
12 0.272727 0.409091 0.318182 48.976967
13 0.392157 0.362745 0.245098 32.499742
14 0.478261 0.445652 0.076087 39.717157
15 0.212121 0.383838 0.404040 29.412134
16 0.180952 0.200000 0.619048 52.514368
17 0.291667 0.266667 0.441667 49.530237
18 0.546875 0.328125 0.125000 30.747987
19 0.312500 0.500000 0.187500 0.114054
20 0.791667 0.208333 0.000000 -2.128640
21 0.785714 0.071429 0.142857 -35.979075
22 0.313253 0.204819 0.481928 20.898059
23 0.310345 0.390805 0.298851 22.602929
24 0.207317 0.365854 0.426829 13.900687
25 0.295652 0.356522 0.347826 51.923302
26 0.089286 0.232143 0.678571 40.222544
27 0.139785 0.193548 0.666667 58.257352
28 0.235294 0.294118 0.470588 41.983823
29 0.300000 0.400000 0.300000 -6.543491
30 1.000000 0.000000 0.000000 -10.704621
31 0.212121 0.242424 0.545455 8.431905
32 0.612903 0.322581 0.064516 3.657005
33 0.460000 0.400000 0.140000 18.138959
34 0.113208 0.207547 0.679245 47.773558
35 0.431818 0.500000 0.068182 36.405993
36 0.233766 0.324675 0.441558 57.564274
37 0.361111 0.277778 0.361111 16.200463
38 0.552239 0.343284 0.104478 28.494876
39 1.000000 0.000000 0.000000 -21.793601
40 0.565217 0.413043 0.021739 8.326079
41 0.121212 0.212121 0.666667 36.812326
42 0.442105 0.410526 0.147368 51.057099
43 0.322835 0.385827 0.291339 49.869359
44 0.344828 0.206897 0.448276 48.736691
45 0.214286 0.357143 0.428571 -26.356277
46 0.601504 0.285714 0.112782 12.037252
47 0.538462 0.333333 0.128205 7.196842
48 0.381818 0.400000 0.218182 49.805581
49 0.361111 0.486111 0.152778 14.462402
50 0.086957 0.086957 0.826087 53.322846
[51 rows x 46 columns]
On voit maintenant que c'est plus claire
Voici les variables à utiliser pour notre modèle, basées sur l'analyse des corrélations:
Pertinent:
- percent_college (0.57) - Forte corrélation positive avec la cible
- small_metro_uic (0.57) - Forte corrélation positive avec la cible
- semi_urban_pct (0.52) - Bonne corrélation positive avec la cible
- median_household_income (-0.66) - Forte corrélation négative avec la cible
- percent_poverty (0.38) - Corrélation modérée positive avec la cible
Possible:
- rural_pct (0.46) - Bonne corrélation positive avec la cible
- percent_bachelor (-0.60) - Forte corrélation négative avec la cible
Cette sélection nous permettra de capturer les différentes dimensions qui influencent votre variable cible tout en limitant la multicolinéarité. Elle combine des indicateurs socio-économiques (revenus, éducation, pauvreté) et des indicateurs géographiques (degré d'urbanisation).
5. MODELISATION¶
5.1. CheckUps¶
Checking de la repartition des votes
df["target"].value_counts(normalize=True)
target 1 0.784314 0 0.215686 Name: proportion, dtype: float64
Observation : Le dataset est très déséquilibré :
- 78.4% des états ont voté Républicain (1)
- 21.6% ont voté Démocrate (0)
Cela peut poser problème, car un modèle de classification risque de favoriser la classe majoritaire et mal prédire les Démocrates.¶
5.2. Séparation features et target¶
df.columns
Index(['id', 'state_name', 'state_code', 'per_gop', 'per_dem', 'total_2016',
'dem_2016', 'gop_2016', 'percent_no_highschool', 'percent_highschool',
'percent_college', 'percent_bachelor', 'percent_poverty',
'median_household_income', 'unemployment_rate', 'Employed_2019',
'Unemployed_2019', 'ruc_1', 'ruc_2', 'ruc_3', 'ruc_4', 'ruc_5', 'ruc_6',
'ruc_7', 'ruc_8', 'ruc_9', 'uic_1', 'uic_2', 'uic_3', 'uic_4', 'uic_5',
'uic_6', 'uic_7', 'uic_8', 'uic_9', 'uic_10', 'uic_11', 'uic_12',
'target', 'urban_pct', 'semi_urban_pct', 'rural_pct', 'large_metro_uic',
'small_metro_uic', 'rural_uic', 'margin'],
dtype='object')
# Liste des colonnes pertinentes
selected_columns = [
"id","state_name", "state_code",
"percent_college", "semi_urban_pct", "median_household_income",
"percent_poverty", "rural_pct", "percent_bachelor"
]
selected_features = [col for col in selected_columns if col in df.columns]
# Création des jeux de données
X = df[selected_features]
y = df["target"]
5.3. Encodage des variables catégorielles¶
categorical_features = ["id","state_name", "state_code"]
encoder = OneHotEncoder(drop="first", sparse_output=False)
X_encoded = pd.DataFrame(encoder.fit_transform(X[categorical_features]))
X_encoded.columns = encoder.get_feature_names_out(categorical_features)
X = X.drop(columns=categorical_features)
# Concaténation les features encodées avec les autres features
X_final = pd.concat([X, X_encoded], axis=1)
X_final.head()
| percent_college | semi_urban_pct | median_household_income | percent_poverty | rural_pct | percent_bachelor | id_2000 | id_4000 | id_5000 | id_6000 | ... | state_code_SD | state_code_TN | state_code_TX | state_code_UT | state_code_VA | state_code_VT | state_code_WA | state_code_WI | state_code_WV | state_code_WY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29.912098 | 0.402985 | 51771 | 15.6 | 0.164179 | 25.468332 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 35.292122 | 0.310345 | 77203 | 10.2 | 0.586207 | 29.551214 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 33.813610 | 0.466667 | 62027 | 13.5 | 0.000000 | 29.466806 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 29.507084 | 0.560000 | 49020 | 16.0 | 0.173333 | 23.027790 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 28.893970 | 0.293103 | 80423 | 11.8 | 0.068966 | 33.925964 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 156 columns
5.4. Modélisation proprement dite¶
def model_comparison(X, y):
# Séparation des données en train et test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# List of models to evaluate
models = {
'Logistic Regression': LogisticRegression(max_iter=5000, random_state=42),
'Random Forest': RandomForestClassifier(random_state=42),
'XGBoost': xgb.XGBClassifier(random_state=42)
}
# Paramètres à tester dans GridSearchCV pour chaque modèle
param_grids = {
'Logistic Regression': {
'logreg__C': [0.1, 1, 10],
'logreg__solver': ['lbfgs', 'liblinear']
},
'Random Forest': {
'rf__n_estimators': [100, 200, 500],
'rf__max_depth': [10, 20, None],
'rf__min_samples_split': [2, 5],
'rf__min_samples_leaf': [1, 2]
},
'XGBoost': {
'xgb__learning_rate': [0.01, 0.1, 0.3],
'xgb__max_depth': [3, 6, 10],
'xgb__n_estimators': [50, 100, 200],
'xgb__subsample': [0.8, 1.0],
'xgb__colsample_bytree': [0.8, 1.0]
}
}
results = []
for model_name, model in models.items():
# Création d'un pipeline pour chaque modèle
if model_name == 'Logistic Regression':
pipeline = Pipeline([('scaler', StandardScaler()), ('logreg', model)])
elif model_name == 'Random Forest':
pipeline = Pipeline([('scaler', StandardScaler()), ('rf', model)])
else: # XGBoost
pipeline = Pipeline([('scaler', StandardScaler()), ('xgb', model)])
# GridSearchCV
grid_search = GridSearchCV(pipeline, param_grids[model_name], cv=5, scoring='f1', n_jobs=-1)
# Entraînement et évaluation
grid_search.fit(X_train, y_train)
y_pred_train = grid_search.predict(X_train)
y_pred_test = grid_search.predict(X_test)
# Collecte des résultats
results.append({
'Model': model_name,
'Best Params': grid_search.best_params_,
'Train F1-Score': classification_report(y_train, y_pred_train, output_dict=True)['1']['f1-score'],
'Test F1-Score': classification_report(y_test, y_pred_test, output_dict=True)['1']['f1-score'],
'Train Accuracy': classification_report(y_train, y_pred_train, output_dict=True)['accuracy'],
'Test Accuracy': classification_report(y_test, y_pred_test, output_dict=True)['accuracy'],
'Train Recall': classification_report(y_train, y_pred_train, output_dict=True)['1']['recall'],
'Test Recall': classification_report(y_test, y_pred_test, output_dict=True)['1']['recall'],
'Train Precision': classification_report(y_train, y_pred_train, output_dict=True)['1']['precision'],
'Test Precision': classification_report(y_test, y_pred_test, output_dict=True)['1']['precision']
})
# Conversion des résultats en DataFrame
results_df = pd.DataFrame(results)
return results_df
def plot_model_metrics(results_df):
"""
Function to plot test metrics from model comparison results
"""
models = results_df['Model']
test_accuracy = results_df['Test Accuracy']
test_f1 = results_df['Test F1-Score']
test_recall = results_df['Test Recall']
# Couleurs harmonieuses et moins contrastées
colors = ['#8ecae6', '#219ebc', '#126782']
x = np.arange(len(models))
width = 0.25 # Plus étroit pour accommoder 3 barres
plt.figure(figsize=(12, 7))
# Création des barres avec les nouvelles couleurs
plt.bar(x - width, test_accuracy, width, label='Test Accuracy', color=colors[0], alpha=0.8)
plt.bar(x, test_f1, width, label='Test F1-Score', color=colors[1], alpha=0.8)
plt.bar(x + width, test_recall, width, label='Test Recall', color=colors[2], alpha=0.8)
plt.xlabel('Modèles', fontsize=12)
plt.ylabel('Score', fontsize=12)
plt.title('Comparaison des métriques de test par modèle', fontsize=14)
plt.xticks(x, models, rotation=15, ha='right')
plt.ylim(0, 1.1)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.3)
# Ajout des valeurs sur les barres
for i, v in enumerate(test_accuracy):
plt.text(i - width, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
for i, v in enumerate(test_f1):
plt.text(i, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
for i, v in enumerate(test_recall):
plt.text(i + width, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
plt.tight_layout()
plt.show()
def plot_accuracy_comparison(results_df):
"""
Function to compare train and test accuracy for each model
"""
models = results_df['Model']
train_accuracy = results_df['Train Accuracy']
test_accuracy = results_df['Test Accuracy']
# Couleurs harmonieuses
colors = ['#219ebc', '#fb8500']
x = np.arange(len(models))
width = 0.35
plt.figure(figsize=(12, 7))
plt.bar(x - width/2, train_accuracy, width, label='Train Accuracy', color=colors[0], alpha=0.8)
plt.bar(x + width/2, test_accuracy, width, label='Test Accuracy', color=colors[1], alpha=0.8)
plt.xlabel('Modèles', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Comparaison de l\'accuracy en train et test par modèle', fontsize=14)
plt.xticks(x, models, rotation=15, ha='right')
plt.ylim(0, 1.1)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.3)
# Ajout des valeurs sur les barres
for i, v in enumerate(train_accuracy):
plt.text(i - width/2, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
for i, v in enumerate(test_accuracy):
plt.text(i + width/2, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
# Calcul et affichage de la différence entre train et test (overfitting)
for i in range(len(models)):
diff = train_accuracy[i] - test_accuracy[i]
plt.text(i, max(train_accuracy[i], test_accuracy[i]) + 0.08,
f'Diff: {diff:.2f}', ha='center', fontsize=10, color='#d62828')
plt.tight_layout()
plt.show()
def plot_recall_comparison(results_df):
"""
Function to compare train and test recall for each model
"""
models = results_df['Model']
train_recall = results_df['Train Recall']
test_recall = results_df['Test Recall']
# Couleurs harmonieuses
colors = ['#4cc9f0', '#f72585']
x = np.arange(len(models))
width = 0.35
plt.figure(figsize=(12, 7))
plt.bar(x - width/2, train_recall, width, label='Train Recall', color=colors[0], alpha=0.8)
plt.bar(x + width/2, test_recall, width, label='Test Recall', color=colors[1], alpha=0.8)
plt.xlabel('Modèles', fontsize=12)
plt.ylabel('Recall', fontsize=12)
plt.title('Comparaison du recall en train et test par modèle', fontsize=14)
plt.xticks(x, models, rotation=15, ha='right')
plt.ylim(0, 1.1)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.3)
# Ajout des valeurs sur les barres
for i, v in enumerate(train_recall):
plt.text(i - width/2, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
for i, v in enumerate(test_recall):
plt.text(i + width/2, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
# Calcul et affichage de la différence entre train et test
for i in range(len(models)):
diff = train_recall[i] - test_recall[i]
plt.text(i, max(train_recall[i], test_recall[i]) + 0.08,
f'Diff: {diff:.2f}', ha='center', fontsize=10, color='#d62828')
plt.tight_layout()
plt.show()
def plot_f1_comparison(results_df):
"""
Function to compare train and test F1-Score for each model
"""
models = results_df['Model']
train_f1 = results_df['Train F1-Score']
test_f1 = results_df['Test F1-Score']
# Couleurs harmonieuses
colors = ['#2a9d8f', '#e76f51']
x = np.arange(len(models))
width = 0.35
plt.figure(figsize=(12, 7))
plt.bar(x - width/2, train_f1, width, label='Train F1-Score', color=colors[0], alpha=0.8)
plt.bar(x + width/2, test_f1, width, label='Test F1-Score', color=colors[1], alpha=0.8)
plt.xlabel('Modèles', fontsize=12)
plt.ylabel('F1-Score', fontsize=12)
plt.title('Comparaison du F1-Score en train et test par modèle', fontsize=14)
plt.xticks(x, models, rotation=15, ha='right')
plt.ylim(0, 1.1)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.3)
# Ajout des valeurs sur les barres
for i, v in enumerate(train_f1):
plt.text(i - width/2, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
for i, v in enumerate(test_f1):
plt.text(i + width/2, v + 0.02, f'{v:.2f}', ha='center', fontsize=9)
# Calcul et affichage de la différence entre train et test
for i in range(len(models)):
diff = train_f1[i] - test_f1[i]
plt.text(i, max(train_f1[i], test_f1[i]) + 0.08,
f'Diff: {diff:.2f}', ha='center', fontsize=10, color='#d62828')
plt.tight_layout()
plt.show()
# Entrainement des modèles
results = model_comparison(X_final, y)
results
| Model | Best Params | Train F1-Score | Test F1-Score | Train Accuracy | Test Accuracy | Train Recall | Test Recall | Train Precision | Test Precision | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistic Regression | {'logreg__C': 0.1, 'logreg__solver': 'liblinear'} | 1.0 | 0.909091 | 1.0 | 0.8750 | 1.0 | 0.833333 | 1.0 | 1.000000 |
| 1 | Random Forest | {'rf__max_depth': 10, 'rf__min_samples_leaf': ... | 1.0 | 0.960000 | 1.0 | 0.9375 | 1.0 | 1.000000 | 1.0 | 0.923077 |
| 2 | XGBoost | {'xgb__colsample_bytree': 0.8, 'xgb__learning_... | 1.0 | 0.869565 | 1.0 | 0.8125 | 1.0 | 0.833333 | 1.0 | 0.909091 |
6. EVALUATION¶
# Affichage des resultats
plot_model_metrics(results)
Analyse comparative des modèles
- Random Forest se démarque avec des performances supérieures sur toutes les métriques (Recall: 1.00, F1-Score: 0.96, Accuracy: 0.94).
- Logistic Regression présente des résultats intermédiaires, tandis que
- XGBoost affiche des performances légèrement moindres.
Pour ce problème, Random Forest offre le meilleur équilibre entre les différentes métriques.
plot_accuracy_comparison(results)
plot_recall_comparison(results)
plot_f1_comparison(results)
Analyse de la capacité de généralisation des modèles¶
Nous avons entraîné trois modèles de classification différents (Régression Logistique, Random Forest et XGBoost) et comparé leurs performances sur les jeux de données d'entraînement et de test. Cette comparaison nous permet d'évaluer leur capacité de généralisation.
Résultats observés¶
Accuracy¶
- Tous les modèles atteignent une accuracy parfaite (1.00) sur les données d'entraînement
- Sur les données de test:
- Random Forest: 0.94 (Diff: 0.06)
- Logistic Regression: 0.88 (Diff: 0.12)
- XGBoost: 0.81 (Diff: 0.19)
Recall¶
- Tous les modèles obtiennent un recall parfait (1.00) sur les données d'entraînement
- Sur les données de test:
- Random Forest: 1.00 (Diff: 0.00)
- Logistic Regression: 0.83 (Diff: 0.17)
- XGBoost: 0.83 (Diff: 0.17)
F1-Score¶
- Tous les modèles atteignent un F1-Score parfait (1.00) sur les données d'entraînement
- Sur les données de test:
- Random Forest: 0.96 (Diff: 0.04)
- Logistic Regression: 0.91 (Diff: 0.09)
- XGBoost: 0.87 (Diff: 0.13)
Interprétation¶
Les graphiques de comparaison train/test nous permettent d'identifier clairement la capacité de généralisation de chaque modèle:
Random Forest présente la meilleure capacité de généralisation:
- Performances élevées sur toutes les métriques en test
- Écarts minimaux entre train et test
- Capacité exceptionnelle à maintenir un recall parfait sur les données de test
Logistic Regression montre une généralisation satisfaisante:
- Performances correctes en test
- Écarts modérés entre train et test
- Bon équilibre entre precision et recall (bon F1-Score)
XGBoost présente des signes de surapprentissage:
- Performances plus faibles en test comparées aux autres modèles
- Écarts plus importants entre train et test
- Pourrait bénéficier d'une meilleure régularisation
On note que le recall, le score f1 ou encore l'accuracy en train on eut des scrore parfait ce qui peut etre problématique et peut faire penser à un overfitting (surapprentissage)
Conclusion¶
La visualisation systématique des performances sur les jeux d'entraînement et de test nous permet de justifier objectivement le choix du modèle Random Forest comme étant le plus fiable pour la généralisation à de nouvelles données. Ce modèle offre le meilleur compromis entre performance prédictive et stabilité entre les différents jeux de données.
Procédons l'analyse des parametres pour voir les plus importantes¶
def analyze_feature_importance(X, y, model_results):
"""
Analyse l'importance des variables de façon globale et locale avec SHAP
Supporte les modèles binaires et multi-classes
Parameters:
-----------
X : DataFrame
Les features utilisées pour l'entraînement
y : Series
La variable cible
model_results : DataFrame
Résultats de la fonction model_comparison
Returns:
--------
dict: Dictionnaire contenant les résultats de l'analyse
"""
# Récupération du meilleur modèle (Random Forest)
best_params = model_results.loc[model_results['Model'] == 'Random Forest', 'Best Params'].values[0]
# Extraction des paramètres optimaux pour Random Forest
params = {}
for key, value in best_params.items():
if key.startswith('rf__'):
params[key[4:]] = value # Enlever le préfixe 'rf__'
# Création et entraînement du modèle avec les meilleurs paramètres
best_rf = RandomForestClassifier(**params, random_state=42)
# Création et entraînement du modèle de régression logistique pour comparer
best_params_logreg = model_results.loc[model_results['Model'] == 'Logistic Regression', 'Best Params'].values[0]
params_logreg = {}
for key, value in best_params_logreg.items():
if key.startswith('logreg__'):
params_logreg[key[8:]] = value # Enlever le préfixe 'logreg__'
best_logreg = LogisticRegression(**params_logreg, random_state=42)
# Standardisation des données pour la régression logistique
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Entraînement des modèles
best_rf.fit(X, y)
best_logreg.fit(X_scaled, y)
# 1. Analyse globale - Feature Importance pour Random Forest
plt.figure(figsize=(10, 6))
feature_importances = pd.DataFrame(
{'feature': X.columns, 'importance': best_rf.feature_importances_}
).sort_values('importance', ascending=False)
sns.barplot(x='importance', y='feature', data=feature_importances, palette='Blues_d')
plt.title('Importance des variables - Random Forest', fontsize=14)
plt.xlabel('Importance')
plt.ylabel('Variables')
plt.tight_layout()
plt.show()
# 2. Analyse globale - Coefficients pour Régression Logistique
plt.figure(figsize=(10, 6))
coefs = pd.DataFrame(
{'feature': X.columns, 'coefficient': best_logreg.coef_[0]}
).sort_values('coefficient', ascending=False)
# Utiliser une palette de couleurs différentes pour les coefficients positifs et négatifs
colors = ['#FF9999' if c < 0 else '#66B2FF' for c in coefs['coefficient']]
sns.barplot(x='coefficient', y='feature', data=coefs, palette=colors)
plt.title('Coefficients - Régression Logistique', fontsize=14)
plt.axvline(x=0, color='gray', linestyle='--')
plt.xlabel('Coefficient')
plt.ylabel('Variables')
plt.tight_layout()
plt.show()
# 3. Analyse locale - SHAP pour Random Forest
# Échantillonnage si le dataset est très grand
sample_size = min(100, len(X))
X_sample = X.sample(sample_size, random_state=42)
# Création de l'explainer SHAP
explainer = shap.TreeExplainer(best_rf)
# Pour les graphiques summary plot, on peut encore utiliser shap_values
shap_values = explainer.shap_values(X_sample)
# Vérifier si nous avons un modèle binaire
is_binary = isinstance(shap_values, list) or (isinstance(shap_values, np.ndarray) and shap_values.ndim > 2)
# Résumé des valeurs SHAP - utiliser la classe 1 pour classification binaire
plt.figure(figsize=(10, 8))
if is_binary:
# Pour les modèles binaires, utiliser l'index 1 (classe positive)
shap.summary_plot(shap_values[1] if isinstance(shap_values, list) else shap_values[:, :, 1],
X_sample, plot_type="bar", show=False)
else:
shap.summary_plot(shap_values, X_sample, plot_type="bar", show=False)
plt.title('Résumé des valeurs SHAP - Random Forest', fontsize=14)
plt.tight_layout()
plt.show()
# Détail des valeurs SHAP
plt.figure(figsize=(12, 10))
if is_binary:
# Pour les modèles binaires, utiliser l'index 1 (classe positive)
shap.summary_plot(shap_values[1] if isinstance(shap_values, list) else shap_values[:, :, 1],
X_sample, show=False)
else:
shap.summary_plot(shap_values, X_sample, show=False)
plt.title('Impact des variables sur les prédictions (SHAP) - Random Forest', fontsize=14)
plt.tight_layout()
plt.show()
# 4. Analyse d'un exemple spécifique
try:
# Prendre un exemple au hasard
example_idx = np.random.randint(0, len(X_sample))
print(f"Analyse d'un exemple spécifique (indice {example_idx}):")
# Afficher les valeurs de l'exemple
example_data = X_sample.iloc[example_idx]
print("\nValeurs de l'exemple:")
for feature, value in example_data.items():
print(f" {feature}: {value}")
# Tenter d'utiliser le SHAP Dependence plot au lieu du waterfall
# qui est plus robuste pour les modèles binaires
plt.figure(figsize=(10, 6))
# Identifier la feature la plus importante
if is_binary:
feature_importance = np.abs(shap_values[1]).mean(0) if isinstance(shap_values, list) else np.abs(shap_values[:, :, 1]).mean(0)
else:
feature_importance = np.abs(shap_values).mean(0)
most_important_idx = np.argmax(feature_importance)
most_important_feature = X.columns[most_important_idx]
# Créer un dependence plot pour la feature la plus importante
if is_binary:
shap_values_to_plot = shap_values[1] if isinstance(shap_values, list) else shap_values[:, :, 1]
else:
shap_values_to_plot = shap_values
shap.dependence_plot(
most_important_idx,
shap_values_to_plot,
X_sample,
feature_names=X.columns,
show=False
)
plt.title(f'Dependence plot pour {most_important_feature}', fontsize=14)
plt.tight_layout()
plt.show()
# Créer une fonction pour visualiser les contributions des features pour un exemple
def plot_feature_contributions(example_idx, shap_values, X_sample, is_binary=False):
# Extraire les valeurs SHAP pour l'exemple
if is_binary:
shap_vals = shap_values[1][example_idx] if isinstance(shap_values, list) else shap_values[example_idx, :, 1]
else:
shap_vals = shap_values[example_idx]
# Créer un DataFrame pour la visualisation
contrib_df = pd.DataFrame({
'Feature': X_sample.columns,
'Contribution': shap_vals
}).sort_values('Contribution', ascending=False)
# Visualiser
plt.figure(figsize=(10, 6))
bars = plt.barh(contrib_df['Feature'], contrib_df['Contribution'])
# Colorer les barres en fonction de leur contribution (positive/négative)
for i, bar in enumerate(bars):
if contrib_df['Contribution'].iloc[i] > 0:
bar.set_color('#66B2FF') # Bleu pour contributions positives
else:
bar.set_color('#FF9999') # Rouge pour contributions négatives
plt.axvline(x=0, color='gray', linestyle='--')
plt.title(f'Contributions des variables pour l\'exemple #{example_idx}', fontsize=14)
plt.xlabel('Impact sur la prédiction')
plt.tight_layout()
plt.show()
# Afficher les contributions numériques
print("\nContributions des variables (top 5):")
for _, row in contrib_df.head(5).iterrows():
print(f" {row['Feature']}: {row['Contribution']:.4f}")
print("\nContributions des variables (bottom 5):")
for _, row in contrib_df.tail(5).iterrows():
print(f" {row['Feature']}: {row['Contribution']:.4f}")
# Calculer la prédiction
if is_binary:
expected_value = explainer.expected_value[1] if isinstance(explainer.expected_value, list) else explainer.expected_value
else:
expected_value = explainer.expected_value
prediction = expected_value + np.sum(shap_vals)
print(f"\nValeur de base (expected value): {expected_value:.4f}")
print(f"Somme des contributions: {np.sum(shap_vals):.4f}")
print(f"Prédiction finale: {prediction:.4f}")
if is_binary:
print(f"Probabilité: {1 / (1 + np.exp(-prediction)):.4f}")
# Utiliser notre fonction personnalisée
plot_feature_contributions(example_idx, shap_values, X_sample, is_binary)
except Exception as e:
print(f"Erreur lors de l'analyse de l'exemple: {e}")
print("\nAnalyse de l'importance des variables terminée.")
return {
"feature_importance_rf": feature_importances,
"coefficients_logreg": coefs,
"shap_explainer": explainer,
"shap_values": shap_values,
"X_sample": X_sample
}
importance_analysis = analyze_feature_importance(X, y, results)
importance_analysis
Analyse d'un exemple spécifique (indice 27): Valeurs de l'exemple: percent_college: 29.9120979309082 semi_urban_pct: 0.4029850746268656 median_household_income: 51771.0 percent_poverty: 15.6 rural_pct: 0.16417910447761197 percent_bachelor: 25.46833229064941
<Figure size 1000x600 with 0 Axes>
Contributions des variables (top 5): rural_pct: 0.0588 median_household_income: 0.0429 percent_bachelor: 0.0330 semi_urban_pct: 0.0325 percent_college: 0.0297 Contributions des variables (bottom 5): median_household_income: 0.0429 percent_bachelor: 0.0330 semi_urban_pct: 0.0325 percent_college: 0.0297 percent_poverty: 0.0120 Erreur lors de l'analyse de l'exemple: unsupported format string passed to numpy.ndarray.__format__ Analyse de l'importance des variables terminée.
{'feature_importance_rf': feature importance
4 rural_pct 0.296621
2 median_household_income 0.192839
0 percent_college 0.168414
1 semi_urban_pct 0.164082
5 percent_bachelor 0.118620
3 percent_poverty 0.059424,
'coefficients_logreg': feature coefficient
0 percent_college 0.338587
4 rural_pct 0.291271
1 semi_urban_pct 0.283146
3 percent_poverty 0.198093
5 percent_bachelor -0.249499
2 median_household_income -0.361014,
'shap_explainer': <shap.explainers._tree.TreeExplainer at 0x7fdad4ea3a30>,
'shap_values': array([[[-5.69003582e-03, 5.69003582e-03],
[-3.70237548e-02, 3.70237548e-02],
[-4.29828621e-02, 4.29828621e-02],
[-1.73212954e-02, 1.73212954e-02],
[-5.81950436e-02, 5.81950436e-02],
[-3.76105377e-02, 3.76105377e-02]],
[[-3.23404282e-02, 3.23404282e-02],
[-3.54733745e-02, 3.54733745e-02],
[-4.54803791e-02, 4.54803791e-02],
[-2.04290890e-02, 2.04290890e-02],
[-2.80710254e-02, 2.80710254e-02],
[-3.70292332e-02, 3.70292332e-02]],
[[ 1.30980957e-02, -1.30980957e-02],
[ 1.01373954e-02, -1.01373954e-02],
[-4.99943063e-02, 4.99943063e-02],
[-5.31275008e-04, 5.31275008e-04],
[-9.00251500e-02, 9.00251500e-02],
[ 1.08491711e-01, -1.08491711e-01]],
[[-5.58919284e-02, 5.58919284e-02],
[-3.11128012e-02, 3.11128012e-02],
[-3.87881638e-02, 3.87881638e-02],
[-2.41225010e-03, 2.41225010e-03],
[-5.01496435e-02, 5.01496435e-02],
[-3.04687423e-02, 3.04687423e-02]],
[[-4.34703395e-02, 4.34703395e-02],
[-2.94851777e-02, 2.94851777e-02],
[-4.10398173e-02, 4.10398173e-02],
[-1.00328889e-02, 1.00328889e-02],
[-4.95695039e-02, 4.95695039e-02],
[-3.52258022e-02, 3.52258022e-02]],
[[-4.03918529e-02, 4.03918529e-02],
[-9.73775060e-03, 9.73775060e-03],
[-4.12205966e-02, 4.12205966e-02],
[-8.22771787e-03, 8.22771787e-03],
[-5.34671897e-02, 5.34671897e-02],
[-3.57784217e-02, 3.57784217e-02]],
[[-3.23246441e-02, 3.23246441e-02],
[-3.29164854e-02, 3.29164854e-02],
[-4.26690692e-02, 4.26690692e-02],
[-1.30768543e-02, 1.30768543e-02],
[-5.48511427e-02, 5.48511427e-02],
[-3.29853337e-02, 3.29853337e-02]],
[[ 1.29090915e-01, -1.29090915e-01],
[-6.44150689e-02, 6.44150689e-02],
[-4.65608541e-02, 4.65608541e-02],
[-5.05119758e-02, 5.05119758e-02],
[-5.07835557e-02, 5.07835557e-02],
[ 3.43570101e-02, -3.43570101e-02]],
[[-3.24619072e-02, 3.24619072e-02],
[-3.03973937e-02, 3.03973937e-02],
[-4.28018291e-02, 4.28018291e-02],
[-1.15030465e-02, 1.15030465e-02],
[-5.48845311e-02, 5.48845311e-02],
[-3.67748218e-02, 3.67748218e-02]],
[[ 1.24248851e-01, -1.24248851e-01],
[ 1.35224880e-01, -1.35224880e-01],
[ 1.50157423e-01, -1.50157423e-01],
[ 3.65254869e-02, -3.65254869e-02],
[ 2.42168606e-01, -2.42168606e-01],
[ 1.02851224e-01, -1.02851224e-01]],
[[ 1.36949614e-03, -1.36949614e-03],
[-5.22828835e-02, 5.22828835e-02],
[-4.87709754e-02, 4.87709754e-02],
[-1.37170739e-03, 1.37170739e-03],
[-7.81701520e-02, 7.81701520e-02],
[ 2.04026927e-02, -2.04026927e-02]],
[[ 1.30252480e-01, -1.30252480e-01],
[ 1.47945364e-01, -1.47945364e-01],
[ 1.57618067e-01, -1.57618067e-01],
[-2.13457296e-02, 2.13457296e-02],
[ 2.31761643e-01, -2.31761643e-01],
[ 1.14944647e-01, -1.14944647e-01]],
[[-4.25333267e-02, 4.25333267e-02],
[-3.39742395e-02, 3.39742395e-02],
[-4.00258930e-02, 4.00258930e-02],
[-4.38139975e-03, 4.38139975e-03],
[-5.76275179e-02, 5.76275179e-02],
[-3.02811526e-02, 3.02811526e-02]],
[[ 1.58339954e-01, -1.58339954e-01],
[ 9.75818029e-02, -9.75818029e-02],
[ 1.30937814e-01, -1.30937814e-01],
[ 1.40137791e-02, -1.40137791e-02],
[ 2.60197755e-01, -2.60197755e-01],
[ 1.10105366e-01, -1.10105366e-01]],
[[-7.49791350e-02, 7.49791350e-02],
[-3.83200571e-02, 3.83200571e-02],
[ 8.96482594e-02, -8.96482594e-02],
[-6.15822698e-03, 6.15822698e-03],
[-7.67519150e-02, 7.67519150e-02],
[ 7.73754529e-03, -7.73754529e-03]],
[[ 4.38430883e-02, -4.38430883e-02],
[ 9.18289826e-02, -9.18289826e-02],
[ 2.71347826e-01, -2.71347826e-01],
[ 1.67548636e-02, -1.67548636e-02],
[ 3.25221134e-02, -3.25221134e-02],
[ 7.48795970e-02, -7.48795970e-02]],
[[-3.92724441e-02, 3.92724441e-02],
[-2.98950497e-02, 2.98950497e-02],
[-4.11336549e-02, 4.11336549e-02],
[-1.03783352e-02, 1.03783352e-02],
[-5.13692238e-02, 5.13692238e-02],
[-3.67748218e-02, 3.67748218e-02]],
[[-3.15281459e-02, 3.15281459e-02],
[-3.23469056e-02, 3.23469056e-02],
[-4.29226776e-02, 4.29226776e-02],
[-1.35254596e-02, 1.35254596e-02],
[-6.26290540e-02, 6.26290540e-02],
[-2.58712866e-02, 2.58712866e-02]],
[[-3.97712179e-02, 3.97712179e-02],
[-2.59650266e-02, 2.59650266e-02],
[-4.25257047e-02, 4.25257047e-02],
[-4.05397639e-03, 4.05397639e-03],
[-6.51682577e-02, 6.51682577e-02],
[-2.13393461e-02, 2.13393461e-02]],
[[ 1.19078681e-01, -1.19078681e-01],
[-4.23660377e-02, 4.23660377e-02],
[-4.32820707e-02, 4.32820707e-02],
[-3.74501565e-02, 3.74501565e-02],
[-5.78672157e-02, 5.78672157e-02],
[-5.69367293e-02, 5.69367293e-02]],
[[-4.99420666e-02, 4.99420666e-02],
[-1.01766532e-02, 1.01766532e-02],
[-3.88150760e-02, 3.88150760e-02],
[-1.31048728e-04, 1.31048728e-04],
[-5.56475135e-02, 5.56475135e-02],
[-3.41111715e-02, 3.41111715e-02]],
[[-3.19006032e-02, 3.19006032e-02],
[-2.46983026e-02, 2.46983026e-02],
[-4.79928095e-02, 4.79928095e-02],
[-2.19612368e-02, 2.19612368e-02],
[-2.47388986e-02, 2.47388986e-02],
[-3.75316788e-02, 3.75316788e-02]],
[[-5.31273647e-02, 5.31273647e-02],
[-3.20356891e-02, 3.20356891e-02],
[-4.18200530e-02, 4.18200530e-02],
[-3.87507927e-03, 3.87507927e-03],
[-7.71999604e-02, 7.71999604e-02],
[ 9.23461708e-03, -9.23461708e-03]],
[[-4.49079717e-02, 4.49079717e-02],
[-3.15105770e-02, 3.15105770e-02],
[-3.93178490e-02, 3.93178490e-02],
[-1.47056493e-02, 1.47056493e-02],
[-6.71094962e-02, 6.71094962e-02],
[-1.12719862e-02, 1.12719862e-02]],
[[-5.80460238e-02, 5.80460238e-02],
[-4.29680437e-02, 4.29680437e-02],
[-3.06914329e-02, 3.06914329e-02],
[ 3.06823174e-02, -3.06823174e-02],
[-7.85338002e-02, 7.85338002e-02],
[ 1.07334537e-02, -1.07334537e-02]],
[[-3.49605994e-02, 3.49605994e-02],
[-3.72765775e-02, 3.72765775e-02],
[-4.26830748e-02, 4.26830748e-02],
[-1.20440347e-02, 1.20440347e-02],
[-5.16592915e-02, 5.16592915e-02],
[-3.01999514e-02, 3.01999514e-02]],
[[-3.86336028e-02, 3.86336028e-02],
[-2.75128896e-02, 2.75128896e-02],
[ 2.15482691e-01, -2.15482691e-01],
[ 7.66952686e-02, -7.66952686e-02],
[ 2.98277355e-01, -2.98277355e-01],
[ 4.68676491e-02, -4.68676491e-02]],
[[-2.96763224e-02, 2.96763224e-02],
[-3.24664485e-02, 3.24664485e-02],
[-4.29086720e-02, 4.29086720e-02],
[-1.19975833e-02, 1.19975833e-02],
[-5.87891694e-02, 5.87891694e-02],
[-3.29853337e-02, 3.29853337e-02]],
[[ 2.62401617e-01, -2.62401617e-01],
[ 1.53281412e-02, -1.53281412e-02],
[-1.13783404e-02, 1.13783404e-02],
[ 7.05713637e-02, -7.05713637e-02],
[-4.41577108e-02, 4.41577108e-02],
[ 1.48411400e-01, -1.48411400e-01]],
[[-5.70271412e-02, 5.70271412e-02],
[-2.36582961e-02, 2.36582961e-02],
[-4.15038688e-02, 4.15038688e-02],
[-3.48109108e-03, 3.48109108e-03],
[-6.44714400e-02, 6.44714400e-02],
[-1.86816922e-02, 1.86816922e-02]],
[[-6.36930468e-02, 6.36930468e-02],
[ 9.50493382e-02, -9.50493382e-02],
[-4.77279310e-02, 4.77279310e-02],
[-5.63373806e-03, 5.63373806e-03],
[-6.80438124e-02, 6.80438124e-02],
[-2.87743393e-02, 2.87743393e-02]],
[[-5.34747535e-02, 5.34747535e-02],
[-3.17162180e-02, 3.17162180e-02],
[-3.76145233e-02, 3.76145233e-02],
[ 1.83455960e-02, -1.83455960e-02],
[-7.89799211e-02, 7.89799211e-02],
[ 8.46162906e-02, -8.46162906e-02]],
[[ 3.68502375e-02, -3.68502375e-02],
[-1.10728352e-02, 1.10728352e-02],
[ 1.20929911e-01, -1.20929911e-01],
[ 8.35737736e-02, -8.35737736e-02],
[ 2.89126620e-01, -2.89126620e-01],
[ 1.11768763e-01, -1.11768763e-01]],
[[-5.80337427e-02, 5.80337427e-02],
[-3.51713246e-02, 3.51713246e-02],
[-3.96986603e-02, 3.96986603e-02],
[-1.96845813e-03, 1.96845813e-03],
[-7.37627538e-02, 7.37627538e-02],
[-1.88589976e-04, 1.88589976e-04]],
[[-5.84697061e-02, 5.84697061e-02],
[-2.33043764e-02, 2.33043764e-02],
[-7.74383211e-03, 7.74383211e-03],
[ 3.89604942e-03, -3.89604942e-03],
[-5.81764540e-02, 5.81764540e-02],
[-2.50252101e-02, 2.50252101e-02]],
[[ 1.45945307e-01, -1.45945307e-01],
[ 8.97559251e-02, -8.97559251e-02],
[ 1.75780297e-01, -1.75780297e-01],
[ 2.42293696e-02, -2.42293696e-02],
[ 2.40123137e-01, -2.40123137e-01],
[ 1.05342435e-01, -1.05342435e-01]],
[[-6.92980370e-02, 6.92980370e-02],
[-6.00726929e-02, 6.00726929e-02],
[-4.93514759e-02, 4.93514759e-02],
[-2.72276985e-02, 2.72276985e-02],
[ 2.42441525e-01, -2.42441525e-01],
[-4.53151498e-02, 4.53151498e-02]],
[[ 1.54544597e-01, -1.54544597e-01],
[ 2.20545756e-01, -2.20545756e-01],
[ 2.12440261e-03, -2.12440261e-03],
[ 1.75522539e-02, -1.75522539e-02],
[ 3.13186693e-01, -3.13186693e-01],
[ 5.32227679e-02, -5.32227679e-02]],
[[-3.47047062e-02, 3.47047062e-02],
[-3.35509800e-02, 3.35509800e-02],
[-4.53754719e-02, 4.53754719e-02],
[-2.15031797e-02, 2.15031797e-02],
[-2.45371369e-02, 2.45371369e-02],
[-3.91520545e-02, 3.91520545e-02]],
[[-5.37958443e-02, 5.37958443e-02],
[-4.51532297e-02, 4.51532297e-02],
[-3.72407464e-02, 3.72407464e-02],
[ 3.05808276e-02, -3.05808276e-02],
[-7.80999511e-02, 7.80999511e-02],
[ 4.88541453e-03, -4.88541453e-03]],
[[-4.47443764e-02, 4.47443764e-02],
[ 2.62466008e-02, -2.62466008e-02],
[-4.87473783e-02, 4.87473783e-02],
[ 9.10437273e-04, -9.10437273e-04],
[-7.02885080e-02, 7.02885080e-02],
[-3.22003047e-02, 3.22003047e-02]],
[[-3.07755956e-03, 3.07755956e-03],
[-3.54187408e-02, 3.54187408e-02],
[-4.72570700e-02, 4.72570700e-02],
[-1.68429860e-02, 1.68429860e-02],
[-7.29453002e-02, 7.29453002e-02],
[-3.32818729e-02, 3.32818729e-02]],
[[-4.64749023e-02, 4.64749023e-02],
[-3.18401334e-02, 3.18401334e-02],
[-3.92009498e-02, 3.92009498e-02],
[-1.09952501e-02, 1.09952501e-02],
[-5.01907736e-02, 5.01907736e-02],
[-3.01215201e-02, 3.01215201e-02]],
[[ 1.35865025e-02, -1.35865025e-02],
[-3.36609858e-02, 3.36609858e-02],
[-4.74008343e-02, 4.74008343e-02],
[-2.21849526e-02, 2.21849526e-02],
[-3.84845992e-02, 3.84845992e-02],
[-5.06786600e-02, 5.06786600e-02]],
[[-5.36768690e-02, 5.36768690e-02],
[ 5.36719526e-03, -5.36719526e-03],
[-3.83688579e-02, 3.83688579e-02],
[-4.37552126e-03, 4.37552126e-03],
[-5.43662371e-02, 5.43662371e-02],
[-3.34032395e-02, 3.34032395e-02]],
[[ 1.38013819e-01, -1.38013819e-01],
[ 8.42321051e-02, -8.42321051e-02],
[ 1.74314743e-01, -1.74314743e-01],
[ 4.19096688e-02, -4.19096688e-02],
[ 2.37474833e-01, -2.37474833e-01],
[ 1.05231302e-01, -1.05231302e-01]],
[[ 9.55114714e-02, -9.55114714e-02],
[ 2.21973319e-01, -2.21973319e-01],
[-8.10532836e-04, 8.10532836e-04],
[ 1.28095656e-02, -1.28095656e-02],
[ 3.25826099e-01, -3.25826099e-01],
[ 1.58665484e-02, -1.58665484e-02]],
[[-2.97952034e-03, 2.97952034e-03],
[-3.49266773e-02, 3.49266773e-02],
[-4.72570700e-02, 4.72570700e-02],
[-1.52743585e-02, 1.52743585e-02],
[-6.92557577e-02, 6.92557577e-02],
[-3.91301456e-02, 3.91301456e-02]],
[[-8.76757444e-03, 8.76757444e-03],
[-3.96405330e-02, 3.96405330e-02],
[-4.50008400e-02, 4.50008400e-02],
[-9.40491662e-03, 9.40491662e-03],
[-9.68369103e-03, 9.68369103e-03],
[-4.63259744e-02, 4.63259744e-02]],
[[-4.90025524e-02, 4.90025524e-02],
[-3.22051823e-02, 3.22051823e-02],
[-3.89613470e-02, 3.89613470e-02],
[-1.10132810e-02, 1.10132810e-02],
[-4.62538721e-02, 4.62538721e-02],
[-3.13872945e-02, 3.13872945e-02]],
[[ 1.29055622e-01, -1.29055622e-01],
[-5.02155873e-02, 5.02155873e-02],
[-4.21503139e-02, 4.21503139e-02],
[-3.70542884e-02, 3.70542884e-02],
[-4.81081558e-02, 4.81081558e-02],
[-5.03508055e-02, 5.03508055e-02]]]),
'X_sample': percent_college semi_urban_pct median_household_income percent_poverty \
43 28.832720 0.484252 64044 13.6
40 30.288157 0.413043 56360 13.9
46 26.959316 0.240602 76471 9.9
12 35.848404 0.500000 60830 11.0
24 32.042648 0.536585 45928 19.5
31 31.799431 0.606061 52021 17.5
17 29.152100 0.408333 52256 16.0
32 24.268467 0.370968 72038 13.1
3 29.507084 0.560000 49020 16.0
30 22.916826 0.000000 85786 9.1
13 28.604910 0.509804 69212 11.4
8 15.547361 0.000000 90395 14.1
49 31.452776 0.458333 64177 10.4
6 24.491169 0.125000 78920 9.9
47 33.307690 0.333333 78674 9.8
4 28.893970 0.293103 80423 11.8
36 31.172707 0.558442 54447 15.1
33 30.872301 0.380000 57388 13.6
19 29.324507 0.562500 58824 10.9
48 25.967607 0.418182 48659 16.2
15 32.542377 0.585859 61807 11.0
9 29.736067 0.313433 59198 12.7
16 31.660761 0.419048 62028 11.3
26 32.688599 0.392857 57248 12.6
44 35.426731 0.482759 75705 8.8
25 30.086369 0.443478 57375 12.9
11 31.638309 0.400000 83734 9.0
0 29.912098 0.402985 51771 15.6
45 25.852291 0.571429 63293 10.1
27 33.386356 0.311828 63290 9.9
34 36.171799 0.188679 67402 10.5
5 29.465919 0.421875 77104 9.4
29 28.643383 0.700000 78571 7.5
37 34.312252 0.500000 66955 11.5
1 35.292122 0.310345 77203 10.2
21 23.049395 0.214286 85700 9.5
2 33.813610 0.466667 62027 13.5
39 26.346718 0.000000 70383 11.6
35 29.063951 0.545455 58704 13.0
23 32.411053 0.471264 74529 8.9
41 32.668640 0.242424 60414 11.9
10 28.107138 0.396226 61950 13.5
22 32.799744 0.518072 59522 12.9
18 27.149839 0.375000 51108 18.8
50 36.730377 0.739130 66152 9.9
20 25.420776 0.208333 86644 9.1
7 26.731159 0.000000 70348 11.2
42 28.035902 0.389474 56047 13.8
14 28.955311 0.467391 57617 11.9
28 33.873901 0.529412 63268 12.7
38 24.395906 0.388060 63455 12.0
rural_pct percent_bachelor
43 0.192913 29.896368
40 0.021739 28.119732
46 0.157895 38.781448
12 0.227273 27.568047
24 0.256098 22.025591
31 0.181818 27.328411
17 0.300000 24.215286
32 0.016129 36.574459
3 0.173333 23.027790
30 0.000000 39.713642
13 0.098039 34.652561
8 0.000000 58.540707
49 0.180556 30.116730
6 0.000000 39.284241
47 0.128205 36.020138
4 0.068966 33.925964
36 0.207792 25.520313
33 0.160000 31.255686
19 0.125000 31.812195
48 0.200000 20.614605
15 0.202020 28.566288
9 0.029851 29.879576
16 0.400000 33.384693
26 0.517857 32.029671
44 0.172414 34.017944
25 0.260870 29.218016
11 0.000000 32.977276
0 0.164179 25.468332
45 0.214286 38.024567
27 0.548387 31.911806
34 0.698113 30.047791
5 0.312500 40.912342
29 0.000000 37.042934
37 0.138889 33.664604
1 0.586207 29.551214
21 0.000000 43.688908
2 0.000000 29.466806
39 0.000000 34.199402
35 0.022727 28.277195
23 0.218391 36.081844
41 0.636364 28.840324
10 0.138365 31.323006
22 0.168675 29.135920
18 0.078125 24.113539
50 0.173913 27.362514
20 0.000000 40.172043
7 0.000000 31.993366
42 0.168421 27.338949
14 0.054348 26.456558
28 0.235294 24.731855
38 0.059701 31.429663 }
7. Exports vers le format HTML¶
!jupyter nbconvert notebook_states.ipynb --to html
[NbConvertApp] Converting notebook notebook_states.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 28 image(s). [NbConvertApp] Writing 49395640 bytes to notebook_states.html